Autocorrelations Decay in Texts and Applicability Limits of Language Models
Nikolay Mikhaylovskiy, Ilya Churilov

TL;DR
This paper investigates how autocorrelation decay in texts relates to the limitations of language models, revealing that language models may struggle with long texts due to their Markovian nature and autocorrelation decay properties.
Contribution
The study empirically links autocorrelation decay in texts to language model applicability limits, highlighting differences in generated versus literary texts.
Findings
Autocorrelations decay in texts follow a power law.
Distributional semantics yields consistent decay exponents across languages.
Generated texts show different autocorrelation decay patterns from literary texts.
Abstract
We show that the laws of autocorrelations decay in texts are closely related to applicability limits of language models. Using distributional semantics we empirically demonstrate that autocorrelations of words in texts decay according to a power law. We show that distributional semantics provides coherent autocorrelations decay exponents for texts translated to multiple languages. The autocorrelations decay in generated texts is quantitatively and often qualitatively different from the literary texts. We conclude that language models exhibiting Markov behavior, including large autoregressive language models, may have limitations when applied to long texts, whether analysis or generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Opinion Dynamics and Social Influence · Computational and Text Analysis Methods
