On the nature of long-range letter correlations in texts
Dmitrii Y. Manin

TL;DR
This paper investigates the origin of long-range letter correlations in texts, attributing them to slow variations in lexical composition and demonstrating their preservation through random shuffling, thus revealing their indirect link to text structure.
Contribution
It introduces a method using random walk analysis and Jensen-Shannon divergence to identify the source of long-range letter correlations in natural texts.
Findings
Correlations stem from slow lexical variations.
Shuffling within a moving window preserves correlations.
Correlations reflect indirect structural properties of texts.
Abstract
The origin of long-range letter correlations in natural texts is studied using random walk analysis and Jensen-Shannon divergence. It is concluded that they result from slow variations in letter frequency distribution, which are a consequence of slow variations in lexical composition within the text. These correlations are preserved by random letter shuffling within a moving window. As such, they do reflect structural properties of the text, but in a very indirect manner.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Machine Learning in Bioinformatics
