Size dependent word frequencies and translational invariance of books
Sebastian Bernhardsson, Luis Enrique Correa da Rocha, Petter, Minnhagen

TL;DR
This paper reveals that novels exhibit size-dependent word frequency distributions and translational invariance, challenging existing text-evolution models and suggesting a specific size transformation model for better analysis.
Contribution
It demonstrates that real novels share key statistical features with null models and introduces a size transformation model to analyze word-frequency distributions more accurately.
Findings
Word-frequency distribution depends on text length
Translational invariance is observed in novels
Size transformation can be modeled by a specific Random Book Transformation
Abstract
It is shown that a real novel shares many characteristic features with a null model in which the words are randomly distributed throughout the text. Such a common feature is a certain translational invariance of the text. Another is that the functional form of the word-frequency distribution of a novel depends on the length of the text in the same way as the null model. This means that an approximate power-law tail ascribed to the data will have an exponent which changes with the size of the text-section which is analyzed. A further consequence is that a novel cannot be described by text-evolution models like the Simon model. The size-transformation of a novel is found to be well described by a specific Random Book Transformation. This size transformation in addition enables a more precise determination of the functional form of the word-frequency distribution. The implications of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Authorship Attribution and Profiling
