On Representation Redundancy in Large-Scale Instruction Tuning Data Selection
Youwei Shu, Shaomian Zheng, Dingnan Jin, Wenjie Qu, Ziyao Guo, Qing Cui, Jun Zhou, Jiaheng Zhang

TL;DR
This paper introduces CRDS, a novel data selection framework for instruction tuning that reduces redundancy in semantic representations, leading to improved data quality and model performance with significantly less data.
Contribution
The paper proposes CRDS, a new method for data selection that mitigates semantic redundancy in large language models, outperforming existing methods and reducing data requirements.
Findings
CRDS-W achieves strong performance with only 3.5% of data.
Both CRDS variants outperform state-of-the-art methods.
CRDS improves data quality and model performance.
Abstract
Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
