A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations
Vijeta Deshpande, Ishita Dasgupta, Uttaran Bhattacharya, Somdeb Sarkhel, Saayan Mitra, Anna Rumshisky

TL;DR
This paper introduces PATTR, a new lexical diversity metric for synthetic texts that accounts for length variations, improving the accuracy of diversity measurement in large language model outputs.
Contribution
The paper proposes PATTR, a length-adjusted diversity metric, and demonstrates its effectiveness over existing metrics in evaluating synthetic text diversity.
Findings
PATTR outperforms MATTR and CR in diversity assessment.
Length bias affects existing metrics, which PATTR mitigates.
PATTR effectively filters highly diverse responses with high length adherence.
Abstract
Synthetic text generated by Large Language Models (LLMs) is increasingly used for further training and improvement of LLMs. Diversity is crucial for the effectiveness of synthetic data, and researchers rely on prompt engineering to improve diversity. However, the impact of prompt variations on response text length, and, more importantly, the consequential effect on lexical diversity measurements, remain underexplored. In this work, we propose Penalty-Adjusted Type-Token Ratio (PATTR), a diversity metric robust to length variations. We generate a large synthetic corpus of over 20M words using seven models from the LLaMA, OLMo, and Phi families, focusing on a creative writing task of video script generation, where diversity is crucial. We evaluate per-response lexical diversity using PATTR and compare it against existing metrics of Moving-Average TTR (MATTR) and Compression Ratio (CR).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
