Measuring Diversity in Synthetic Datasets

Yuchang Zhu; Huizhe Zhang; Bingzhe Wu; Jintang Li; Zibin Zheng; Peilin Zhao; Liang Chen; Yatao Bian

arXiv:2502.08512·cs.CL·August 15, 2025

Measuring Diversity in Synthetic Datasets

Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces DCScore, a new classification-based method for measuring the diversity of synthetic datasets in NLP, which is more effective and computationally efficient than existing approaches.

Contribution

The paper presents DCScore, a novel diversity measurement method that is theoretically grounded, correlates well with diversity truths, and reduces computational costs.

Findings

01

DCScore correlates strongly with diversity pseudo-truths

02

It reduces computational costs compared to existing methods

03

Theoretical verification confirms its validity as a diversity measure

Abstract

Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bluewhalelab/dcscore
pytorchOfficial

Videos

Measuring Diversity in Synthetic Datasets· slideslive

Taxonomy

TopicsQualitative Comparative Analysis Research