Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance
Fermin Moscoso del Prado Martin

TL;DR
This paper introduces a new measure called derivational entropy rate, linking grammatical diversity to utterance length, and presents the SITE tool for accurate estimation from small corpora, with implications for NLP and language processing.
Contribution
It establishes a theoretical and empirical connection between derivational entropy and mean length of utterances, introducing the derivational entropy rate and the SITE tool for small corpus analysis.
Findings
Derivational entropy rate effectively measures grammatical complexity.
MLU is a fundamental index of syntactic diversity.
SITE accurately estimates entropy measures from small treebanks.
Abstract
In many fields, such as language acquisition, neuropsychology of language, the study of aging, and historical linguistics, corpora are used for estimating the diversity of grammatical structures that are produced during a period by an individual, community, or type of speakers. In these cases, treebanks are taken as representative samples of the syntactic structures that might be encountered. Generalizing the potential syntactic diversity from the structures documented in a small corpus requires careful extrapolation whose accuracy is constrained by the limited size of representative sub-corpora. In this article, I demonstrate -- theoretically, and empirically -- that a grammar's derivational entropy and the mean length of the utterances (MLU) it generates are fundamentally linked, giving rise to a new measure, the derivational entropy rate. The mean length of utterances becomes the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
