Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection
Hongyi He, Xiao Liu, Zhenghao Lin, Mingni Tang, Yi Cheng, Jintao Wang, Wenjie Li, Peng Cheng, Yeyun Gong

TL;DR
This paper introduces ODiS, a diversity-aware data selection method for large language models that decomposes correlated metrics into orthogonal dimensions, improving data quality and diversity, and leading to better downstream performance.
Contribution
The paper proposes the Orthogonal Diversity-Aware Selection (ODiS) algorithm, which ensures diversity by decorrelating multiple data quality metrics and selecting top data within each orthogonal dimension.
Findings
ODiS achieves less than 2% overlap between dimensions.
Models trained with ODiS-selected data outperform baselines on benchmarks.
Orthogonal, diversity-aware selection improves downstream task performance.
Abstract
High-quality pre-training data is crutial for large language models, where quality captures factual reliability and semantic value, and diversity ensures broad coverage and distributional heterogeneity. Existing approaches typically rely on single or multiple-dimensional score-based selection. However, directly selecting top-scored data often degrades performance, and sampling from a broader range is required to recover results. The above non-monotonicity between dataset scores and downstream benchmark results reveals a fundamental bias: score-based methods collapse correlated dimensions, causing top-scored data to appear high-quality while systematically overlooking diversity. We argue that ensuring diversity requires decomposing correlated metrics into orthogonal feature dimensions, from which the top-scored data can be directly selected. Therefore, we proposed the Orthogonal…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well written - The introduction is clear
1. I think a fundamental limitation of data selection research is that the scale with which it is conducted is too small. The difference between the best 100B tokens and the average 100B tokens is massive, and thus data selection methods can make a huge difference. But once one is forced to go to 50T-100T tokens, there simply aren’t 50T high quality tokens and 50T low quality tokens to separate. This is the point made by https://openaccess.thecvf.com/content/CVPR2024/html/Goyal_Scaling_Laws_for_
The proposed ODiS approach is conceptually simple and effective, showing that removing correlation among quality metrics is a useful insight and appropriately implemented via PCA. The experimental results demonstrate consistent gains over widely used baselines. The method seems to be model-agnostic, scalable, and practical. The study highlights a non-monotonic relationship between data quality scores and downstream performance, providing a compelling explanation grounded in diversity.
The method relies heavily on GPT-based scoring to obtain 11-dimensional metrics. This is a substantial concern, as it introduces bias and cost concerns. The explanation of how thresholds per PC are chosen is underspecified. Interpretation of principal components remains unclear, limiting insight into what semantic attributes each dimension captures. Experiments are limited to a single language and corpus (Chinese Nemotron-CC) and one model scale (1.5B), raising concerns about generality. M
The logic is clear, and most arguments are supported by convincing empirical evidence: - There are correlations between existing data selection metrics → Figure 5(a) - PCA can remove these correlations → Figure 6(b) - Selecting data according to the PC dimension scores (the first four in practice) improves performance → Table 1 - In contrast, selecting according to only one PC dimension compromises performance → Figure 3(a)
There are some typos or mistakes in the figures/tables: - (W1) In Figure 1, the Arc-Easy results are lower than those of the more challenging Arc-Challenge. - (W2) In Table 1, the results are reported as averages over five domains. However, the average scores appear to be incorrect. This applies to the proposed method (ODiS) as well as to the baselines PC Average-Sample and PC Average-Top, while the averages for DSIR and Random Selection seem correct. Additionally, (W3) the performance of the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
