Measuring Lexical Diversity in Texts: The Twofold Length Problem
Yves Bestgen

TL;DR
This paper critically reviews lexical diversity indices, highlighting the persistent length dependency problem and proposing solutions that normalize text length, but also noting their sensitivity to parameter choices.
Contribution
It provides a comprehensive analysis of existing indices, identifies their limitations regarding length dependency and parameter sensitivity, and offers recommendations for improved lexical diversity measurement.
Findings
Indices that normalize text length mitigate the length dependency issue.
All indices tested are sensitive to the parameter setting for text length reduction.
The paper offers practical recommendations for optimizing lexical diversity analysis.
Abstract
The impact of text length on the estimation of lexical diversity has captured the attention of the scientific community for more than a century. Numerous indices have been proposed, and many studies have been conducted to evaluate them, but the problem remains. This methodological review provides a critical analysis not only of the most commonly used indices in language learning studies, but also of the length problem itself, as well as of the methodology for evaluating the proposed solutions. The analysis of three datasets of English language-learners' texts revealed that indices that reduce all texts to the same length using a probabilistic or an algorithmic approach solve the length dependency problem; however, all these indices failed to address the second problem, which is their sensitivity to the parameter that determines the length to which the texts are reduced. The paper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Second Language Acquisition and Learning
