Complex Systems

Information Approach to Co-occurrence of Words in Written
Language Download PDF

Damián G. Hernández

Consejo Nacional de Investigaciones Científicas y Técnicas
Centro Atómico Bariloche and Instituto Balseiro
8400 Bariloche, Río Negro, Argentina

Abstract

In this paper we study the distribution of words across the different parts of a book using tools from information theory. In particular, the mutual information between words in the text and parts of the text is compared with the mutual information of a shuffled version of the book. This analysis allows us to extract not only relevant words of the text but also relationships between the different words, such as co-occurrence and repulsion between them. With the connections due to co-occurrence of words, we show how to construct a network that reflects the semantic organization of the book. This method can be applied to other types of sequences, measuring the relations between the different symbols that compose such sequences.

https://doi.org/10.25088/ComplexSystems.24.2.127