Correct ordering in the Zipf-Poisson ensemble

Justin S. Dyer; Art B. Owen

arXiv:1101.2481·stat.ME·January 14, 2011

Correct ordering in the Zipf-Poisson ensemble

Justin S. Dyer, Art B. Owen

PDF

Open Access

TL;DR

This paper analyzes the ordering of word frequencies modeled by a Zipf-Poisson ensemble, establishing probabilistic bounds for correct ordering of top-ranked words as the total count grows large.

Contribution

It provides explicit probabilistic bounds for the correct ordering of top elements in a Zipf-Poisson model, including practical estimates for real-world data like the British National Corpus.

Findings

01

First n' words are correctly ordered with high probability up to a specific growth rate.

02

The exact rate of N^{1/( extalpha+2)} cannot be achieved.

03

In a large corpus, the top 72 words are correctly ordered with high probability.

Abstract

We consider a Zipf--Poisson ensemble in which $X_{i} \sim \poi (N i^{- α})$ for $α > 1$ and $N > 0$ and integers $i \geq 1$ . As $N \to \infty$ the first $n^{'} (N)$ random variables have their proper order $X_{1} > X_{2} > ... > X_{n^{'}}$ relative to each other, with probability tending to 1 for $n^{'}$ up to $(A N / lo g (N))^{1/ (α + 2)}$ for an explicit constant $A (α) \geq 3/4$ . The rate $N^{1/ (α + 2)}$ cannot be achieved. The ordering of the first $n^{'} (N)$ entities does not preclude $X_{m} > X_{n^{'}}$ for some interloping $m > n^{'}$ . The first $n "$ random variables are correctly ordered exclusive of any interlopers, with probability tending to 1 if $n " \leq (B N / lo g (N))^{1/ (α + 2)}$ for $B < A$ . For a Zipf--Poisson model of the British National Corpus, which has a total word count of $100, 000, 000$ , our result estimates that the 72 words with the highest counts are properly ordered.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Topic Modeling