Rethinking Representativeness and Diversity in Dynamic Data Selection
Yuzhe Zhou, Zhenglin Hua, Haiyun Guo, Yuheng Jia

TL;DR
This paper introduces a novel dynamic data selection method that improves training efficiency by focusing on dataset coverage and process-level diversity, achieving over 2x acceleration without sacrificing accuracy.
Contribution
It redefines representativeness and diversity in data selection, proposing a framework that balances frequent and rare factors during training without extra computational overhead.
Findings
Achieves over 2x training acceleration.
Matches or exceeds full-data accuracy.
Demonstrates effectiveness across vision and text benchmarks.
Abstract
Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy. We rethink two core notions underlying sample evaluation: representativeness and diversity. Instead of local geometric centrality, we define representativeness as coverage of dataset-level common or high-frequency feature factors. Instead of within-subset dispersion, we define diversity at the process level, requiring the selection trajectory to gradually include complementary rare factors over training. Based on this view, we propose a dynamic selection framework with three components. First, we score representativeness in a plug-in feature space to prioritize samples covering frequent factors. We instantiate this with a sparse autoencoder trained on the target dataset, using sparse unit activations to summarize both individual samples and dataset-wide factor statistics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Generative Adversarial Networks and Image Synthesis
