Correct ordering in the Zipf-Poisson ensemble
Justin S. Dyer, Art B. Owen

TL;DR
This paper analyzes the ordering of word frequencies modeled by a Zipf-Poisson ensemble, establishing probabilistic bounds for correct ordering of top-ranked words as the total count grows large.
Contribution
It provides explicit probabilistic bounds for the correct ordering of top elements in a Zipf-Poisson model, including practical estimates for real-world data like the British National Corpus.
Findings
First n' words are correctly ordered with high probability up to a specific growth rate.
The exact rate of N^{1/( extalpha+2)} cannot be achieved.
In a large corpus, the top 72 words are correctly ordered with high probability.
Abstract
We consider a Zipf--Poisson ensemble in which for and and integers . As the first random variables have their proper order relative to each other, with probability tending to 1 for up to for an explicit constant . The rate cannot be achieved. The ordering of the first entities does not preclude for some interloping . The first random variables are correctly ordered exclusive of any interlopers, with probability tending to 1 if for . For a Zipf--Poisson model of the British National Corpus, which has a total word count of , our result estimates that the 72 words with the highest counts are properly ordered.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Topic Modeling
