Disentangling the Roles of Representation and Selection in Data Pruning

Yupei Du; Yingjin Song; Hugh Mee Wong; Daniil Ignatev; Albert Gatt; Dong Nguyen

arXiv:2507.03648·cs.CL·July 8, 2025

Disentangling the Roles of Representation and Selection in Data Pruning

Yupei Du, Yingjin Song, Hugh Mee Wong, Daniil Ignatev, Albert Gatt, Dong Nguyen

PDF

1 Video

TL;DR

This paper systematically analyzes data pruning in NLP, emphasizing the importance of data representations like training gradients and revealing that different selection algorithms have varied strengths and may not always meet their intended goals.

Contribution

It decomposes data pruning into representation and selection components, providing theoretical and empirical insights into their roles and effectiveness.

Findings

01

Better representations improve instance selection.

02

Different algorithms excel in different scenarios.

03

Selection algorithms may not always align with their objectives.

Abstract

Data pruning, selecting small but impactful subsets, offers a promising way to efficiently scale NLP model training. However, existing methods often involve many different design choices, which have not been systematically studied. This limits future developments. In this work, we decompose data pruning into two key components: the data representation and the selection algorithm, and we systematically analyze their influence on the selection of instances. Our theoretical and empirical results highlight the crucial role of representations: better representations, e.g., training gradients, generally lead to a better selection of instances, regardless of the chosen selection algorithm. Furthermore, different selection algorithms excel in different settings, and none consistently outperforms the others. Moreover, the selection algorithms do not always align with their intended objectives:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Disentangling the Roles of Representation and Selection in Data Pruning· underline