On the role of autocorrelations in texts

D.V. Lande; A.A. Snarskii

arXiv:0710.0225·cs.CL·October 2, 2007·1 cites

On the role of autocorrelations in texts

D.V. Lande, A.A. Snarskii

PDF

Open Access

TL;DR

This paper explores the use of autocorrelation-based data compression as a criterion to distinguish meaningful texts from arbitrary word sets, which could improve content indexing and signal separation.

Contribution

It introduces a novel criterion based on autocorrelations and data compression for identifying meaningful texts, expanding beyond traditional Zipf law approaches.

Findings

01

Autocorrelation measures can differentiate meaningful texts from random word sets.

02

Data compression effectiveness correlates with the presence of autocorrelations in texts.

03

The proposed criterion offers a new method for text analysis and classification.

Abstract

The task of finding a criterion allowing to distinguish a text from an arbitrary set of words is rather relevant in itself, for instance, in the aspect of development of means for internet-content indexing or separating signals and noise in communication channels. The Zipf law is currently considered to be the most reliable criterion of this kind [3]. At any rate, conventional stochastic word sets do not meet this law. The present paper deals with one of possible criteria based on the determination of the degree of data compression.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Cellular Automata and Applications · semigroups and automata theory