Distinct word length frequencies: distributions and symbol entropies

Reginald D. Smith

arXiv:1207.2334·cs.CL·July 17, 2012·31 cites

Distinct word length frequencies: distributions and symbol entropies

Reginald D. Smith

PDF

Open Access

TL;DR

This paper analyzes the distribution of distinct word lengths in languages using empirical data and information theory, deriving models for word frequency, mean length, and entropy-based estimates.

Contribution

It introduces two methods—empirical distribution analysis and entropy-based modeling—to estimate word length frequencies and related statistics in languages.

Findings

01

Derived a distribution explaining the number of distinct words by length

02

Estimated mean word length and variance from letter and space probabilities

03

Demonstrated entropy methods can estimate word frequency and higher order entropies

Abstract

The distribution of frequency counts of distinct words by length in a language's vocabulary will be analyzed using two methods. The first, will look at the empirical distributions of several languages and derive a distribution that reasonably explains the number of distinct words as a function of length. We will be able to derive the frequency count, mean word length, and variance of word length based on the marginal probability of letters and spaces. The second, based on information theory, will demonstrate that the conditional entropies can also be used to estimate the frequency of distinct words of a given length in a language. In addition, it will be shown how these techniques can also be applied to estimate higher order entropies using vocabulary word length.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFractal and DNA sequence analysis · Machine Learning in Bioinformatics · Artificial Immune Systems Applications