Co-occurrence of the Benford-like and Zipf Laws Arising from the Texts Representing Human and Artificial Languages
Evgeny Shulzinger, Irina Legchenkova, Edward Bormashenko

TL;DR
This study reveals that large texts in human and artificial languages exhibit both Benford-like and Zipf laws, with specific differences in distribution patterns and slopes between language types, highlighting underlying statistical regularities.
Contribution
It demonstrates the co-occurrence of Benford-like and Zipf laws in large texts across human and artificial languages, revealing distinct distribution characteristics.
Findings
Zipf law holds with inverse proportionality between rank and frequency.
Benford-like distribution of leading numbers is unaffected by removing common words.
Artificial languages show larger slopes in distribution plots than human languages.
Abstract
We demonstrate that large texts, representing human (English, Russian, Ukrainian) and artificial (C++, Java) languages, display quantitative patterns characterized by the Benford-like and Zipf laws. The frequency of a word following the Zipf law is inversely proportional to its rank, whereas the total numbers of a certain word appearing in the text generate the uneven Benford-like distribution of leading numbers. Excluding the most popular words essentially improves the correlation of actual textual data with the Zipfian distribution, whereas the Benford distribution of leading numbers (arising from the overall amount of a certain word) is insensitive to the same elimination procedure. The calculated values of the moduli of slopes of double logarithmical plots for artificial languages (C++, Java) are markedly larger than those for human ones.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBenford’s Law and Fraud Detection · Authorship Attribution and Profiling · Complex Systems and Time Series Analysis
