General Type Token Distribution

Shohei Hidaka

arXiv:1305.0328·stat.ME·June 27, 2014

General Type Token Distribution

Shohei Hidaka

PDF

Open Access

TL;DR

This paper develops a statistical framework for estimating the total number of types in a corpus based on observed sample data, providing exact and asymptotic distributions and validating a new estimator.

Contribution

It introduces a novel estimator for the total number of types in a corpus, derived from exact and asymptotic distributions conditioned on sample data.

Findings

01

Derived exact and asymptotic distributions for observed types

02

Validated the estimator through numerical experiments

03

Provided theoretical insights into type distribution estimation

Abstract

We consider the problem of estimating the number of types in a corpus using the number of types observed in a sample of tokens from that corpus. We derive exact and asymptotic distributions for the number of observed types, conditioned upon the number of tokens and the latent type distribution. We use the asymptotic distributions to derive an estimator of the latent number of types and we validate this estimator numerically.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Algorithms and Data Compression