A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences
Marcelo A. Montemurro, Mirko Degli Esposti

TL;DR
This paper introduces a surrogate model for symbolic sequences that simultaneously preserves word frequency distributions and long-range correlation structures, aiding the analysis of complex systems like language and DNA.
Contribution
The authors develop a novel surrogate generation method that maintains both empirical symbol frequencies and long-range correlations, unlike previous models.
Findings
Successfully reproduces Zipf's law and long-range correlations in language and DNA sequences.
Validates the surrogate model on English and Latin texts, as well as genomic DNA.
Provides a tool for analyzing the structural features and origins of scaling laws in symbolic data.
Abstract
Symbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf's law for word frequencies together with persistent correlations spanning hundreds or thousands of tokens, while in DNA it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models usually preserve either the frequency distribution or the correlation properties, but not both simultaneously. We introduce a surrogate model that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. Our method generates surrogates of symbolic sequences by mapping fractional Gaussian…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Language and cultural evolution · Complex Systems and Time Series Analysis
