On the role of autocorrelations in texts
D.V. Lande, A.A. Snarskii

TL;DR
This paper explores the use of autocorrelation-based data compression as a criterion to distinguish meaningful texts from arbitrary word sets, which could improve content indexing and signal separation.
Contribution
It introduces a novel criterion based on autocorrelations and data compression for identifying meaningful texts, expanding beyond traditional Zipf law approaches.
Findings
Autocorrelation measures can differentiate meaningful texts from random word sets.
Data compression effectiveness correlates with the presence of autocorrelations in texts.
The proposed criterion offers a new method for text analysis and classification.
Abstract
The task of finding a criterion allowing to distinguish a text from an arbitrary set of words is rather relevant in itself, for instance, in the aspect of development of means for internet-content indexing or separating signals and noise in communication channels. The Zipf law is currently considered to be the most reliable criterion of this kind [3]. At any rate, conventional stochastic word sets do not meet this law. The present paper deals with one of possible criteria based on the determination of the degree of data compression.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Cellular Automata and Applications · semigroups and automata theory
