N-tuple Zipf Analysis and Modeling for Language, Computer Program and DNA
Xiaocong Gan, Dahui Wang, Zhangang Han

TL;DR
This paper introduces a simple preferential selection model based on random copy-paste processes to explain the n-tuple power law observed in language, DNA, and computer code, supported by empirical data and simulations.
Contribution
It proposes a novel, simple model inspired by Simon's model that reproduces n-tuple Zipf laws and DNA symmetry breaking, validated by empirical data and simulations.
Findings
Model reproduces n-tuple power law in simulated data.
Estimation equations match empirical Zipf exponents.
Captures DNA symmetry breaking process.
Abstract
n-tuple power law widely exists in language, computer program code, DNA and music. After a vast amount of Zipf analyses of n-tuple power law from empirical data, we propose a model to explain the n-tuple power law feature existed in these information translational carriers. Our model is a preferential selection approach inspired by Simon's model which explained scaling law of single symbol in a sequence Zipf analysis. The kernel mechanism is neat and simple in our model. It can be simply described as a randomly copy and paste process, that is, randomly select a random segment from current sequence and attach it to the end repeatedly. The simulation of our model shows that n-tuple power law exists in model generated data. Furthermore, two estimation equations: the Zipf exponent and the minimal length of n-tuple for power law appears all correspond to empirical data well. Our model can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · Fractal and DNA sequence analysis · DNA and Biological Computing
