
TL;DR
This paper develops a statistical framework for estimating the total number of types in a corpus based on observed sample data, providing exact and asymptotic distributions and validating a new estimator.
Contribution
It introduces a novel estimator for the total number of types in a corpus, derived from exact and asymptotic distributions conditioned on sample data.
Findings
Derived exact and asymptotic distributions for observed types
Validated the estimator through numerical experiments
Provided theoretical insights into type distribution estimation
Abstract
We consider the problem of estimating the number of types in a corpus using the number of types observed in a sample of tokens from that corpus. We derive exact and asymptotic distributions for the number of observed types, conditioned upon the number of tokens and the latent type distribution. We use the asymptotic distributions to derive an estimator of the latent number of types and we validate this estimator numerically.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Algorithms and Data Compression
