The Dependence of Frequency Distributions on Multiple Meanings of Words, Codes and Signs
Xiaoyong Yan, Petter Minnhagen

TL;DR
This paper investigates how multiple meanings of words influence frequency distributions in texts, using coding techniques to measure effects and comparing different languages to understand underlying patterns.
Contribution
It introduces a predictive theory linking word coding and multiple meanings to frequency distribution shapes across languages.
Findings
English word-frequency distribution is broad and fat-tailed.
Coding words by fewer letters makes the distribution exponential.
Chinese characters' frequency distributions are similar to coded English words.
Abstract
The dependence of the frequency distributions due to multiple meanings of words in a text is investigated by deleting letters. By coding the words with fewer letters the number of meanings per coded word increases. This increase is measured and used as an input in a predictive theory. For a text written in English, the word-frequency distribution is broad and fat-tailed, whereas if the words are only represented by their first letter the distribution becomes exponential. Both distribution are well predicted by the theory, as is the whole sequence obtained by consecutively representing the words by the first L=6,5,4,3,2,1 letters. Comparisons of texts written by Chinese characters and the same texts written by letter-codes are made and the similarity of the corresponding frequency-distributions are interpreted as a consequence of the multiple meanings of Chinese characters. This further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
