Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

Brando Miranda; Alycia Lee; Sudharsan Sundar; Allison Casasola; Rylan Schaeffer; Elyas Obbad; Sanmi Koyejo

arXiv:2306.13840·cs.CL·July 4, 2025·6 cites

Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

Brando Miranda, Alycia Lee, Sudharsan Sundar, Allison Casasola, Rylan Schaeffer, Elyas Obbad, Sanmi Koyejo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the diversity coefficient, a formal measure of data variability, and demonstrates its relevance for assessing data quality and improving language model performance.

Contribution

It formalizes the diversity coefficient as a data quality metric and shows its correlation with model evaluation performance across multiple models.

Findings

01

Diversity coefficient increases with latent concept count.

02

Pre-training datasets have high formal diversity.

03

Diversity coefficient correlates with downstream performance.

Abstract

Current trends in pre-training Large Language Models (LLMs) primarily focus on the scaling of model and dataset size. While the quality of pre-training data is considered an important factor for training powerful LLMs, it remains a nebulous concept that has not been rigorously characterized. To this end, we propose a formalization of one key aspect of data quality -- measuring the variability of natural language data -- specifically via a measure we call the diversity coefficient. Our empirical analysis shows that the proposed diversity coefficient aligns with the intuitive properties of diversity and variability, e.g., it increases as the number of latent concepts increases. Then, we measure the diversity coefficient of publicly available pre-training datasets and demonstrate that their formal diversity is high compared to theoretical lower and upper bounds. Finally, we conduct a…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

This paper tackles an important problem, potentially deepening our understanding of the effects of pretraining data quality particularly with respect to diversity. The approach combines concepts from vision (task2vec) with language modeling to create an original measure for textual data diversity.

Weaknesses

1. The use of cosine distance between task2vec embeddings as a diversity measure is not well motivated. Can we consider other textual embedding methods? What are the tradeoffs? 2. The absolute diversity coefficient value does not convey much information about what it means - I understand the author's argument for conceptual lower and upper bounds but these do not capture any representation of real natural language. The work should present a better way to understand the value of the measure of di

Reviewer 02Rating 5Confidence 2

Strengths

A tool to measure pre-training data diversity.

Weaknesses

- While the authors evaluate performance against pre-training data diversity (section 3.1), the metric for performance is cross-entropy loss in LM, I wonder whether specific task evaluation on NLP tasks would make sense here (many NLP benchmarks used to benchmark LLMs on tasks like question answering, GLUE, etc.). - The Vendi Score seems to be another approach to compute diversity, why was not included as a baseline in the main experiments (at least in main Table 1)?

Reviewer 03Rating 3Confidence 3

Strengths

This work focuses on an important and potentially very impactful area of research. While the general belief of the community is that data diversity is an important factor for the performance of language models, the relationship between pre-training data diversity and downstream performance is currently poorly understood. The use of Task2Vec embeddings to measure data diversity is to the best of my knowledge novel. The diversity results presented in Table 1 align well with current intuitions in t

Weaknesses

> Claim: “PRE-TRAINING IN HIGHER DIVERSITY LEADS TO BETTER EVALUATION PERFORMANCE” (Section 3.1) The experiments in Section 3.1, aiming to relate their proposed measure of data diversity with performance, which in my view would amount to the most substantial contribution of the paper, are highly unconvincing. Only three datasets are considered: PubMed, USPTO, and PubMed+USPTO. Linear regressions on three (!) datapoints are presented as evidence to substantiate the authors claim. Clearly, the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management

MethodsFocus