Maximal Repetition and Zero Entropy Rate
{\L}ukasz D\k{e}bowski

TL;DR
This paper explores the growth of maximal repetition in strings from stochastic processes, establishing bounds based on entropy measures and showing natural language's complexity exceeds typical Markov models.
Contribution
It provides new bounds for maximal repetition growth using entropy and demonstrates natural language cannot be modeled by standard hidden Markov processes.
Findings
Maximal repetition growth is bounded by entropy measures.
Natural language exhibits growth incompatible with hidden Markov models.
Power-law growth of repetition implies zero conditional Rényi entropy rate.
Abstract
Maximal repetition of a string is the maximal length of a repeated substring. This paper investigates maximal repetition of strings drawn from stochastic processes. Strengthening previous results, two new bounds for the almost sure growth rate of maximal repetition are identified: an upper bound in terms of conditional R\'enyi entropy of order given a sufficiently long past and a lower bound in terms of unconditional Shannon entropy (). Both the upper and the lower bound can be proved using an inequality for the distribution of recurrence time. We also supply an alternative proof of the lower bound which makes use of an inequality for the expectation of subword complexity. In particular, it is shown that a power-law logarithmic growth of maximal repetition with respect to the string length, recently observed for texts in natural language, may hold only if the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
