Rethinking Representativeness and Diversity in Dynamic Data Selection

Yuzhe Zhou; Zhenglin Hua; Haiyun Guo; Yuheng Jia

arXiv:2603.04981·cs.AI·March 6, 2026

Rethinking Representativeness and Diversity in Dynamic Data Selection

Yuzhe Zhou, Zhenglin Hua, Haiyun Guo, Yuheng Jia

PDF

Open Access

TL;DR

This paper introduces a novel dynamic data selection method that improves training efficiency by focusing on dataset coverage and process-level diversity, achieving over 2x acceleration without sacrificing accuracy.

Contribution

It redefines representativeness and diversity in data selection, proposing a framework that balances frequent and rare factors during training without extra computational overhead.

Findings

01

Achieves over 2x training acceleration.

02

Matches or exceeds full-data accuracy.

03

Demonstrates effectiveness across vision and text benchmarks.

Abstract

Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy. We rethink two core notions underlying sample evaluation: representativeness and diversity. Instead of local geometric centrality, we define representativeness as coverage of dataset-level common or high-frequency feature factors. Instead of within-subset dispersion, we define diversity at the process level, requiring the selection trajectory to gradually include complementary rare factors over training. Based on this view, we propose a dynamic selection framework with three components. First, we score representativeness in a plug-in feature space to prioritize samples covering frequent factors. We instantiate this with a sparse autoencoder trained on the target dataset, using sparse unit activations to summarize both individual samples and dataset-wide factor statistics.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Generative Adversarial Networks and Image Synthesis