TL;DR
This paper systematically analyzes data pruning in NLP, emphasizing the importance of data representations like training gradients and revealing that different selection algorithms have varied strengths and may not always meet their intended goals.
Contribution
It decomposes data pruning into representation and selection components, providing theoretical and empirical insights into their roles and effectiveness.
Findings
Better representations improve instance selection.
Different algorithms excel in different scenarios.
Selection algorithms may not always align with their objectives.
Abstract
Data pruning, selecting small but impactful subsets, offers a promising way to efficiently scale NLP model training. However, existing methods often involve many different design choices, which have not been systematically studied. This limits future developments. In this work, we decompose data pruning into two key components: the data representation and the selection algorithm, and we systematically analyze their influence on the selection of instances. Our theoretical and empirical results highlight the crucial role of representations: better representations, e.g., training gradients, generally lead to a better selection of instances, regardless of the chosen selection algorithm. Furthermore, different selection algorithms excel in different settings, and none consistently outperforms the others. Moreover, the selection algorithms do not always align with their intended objectives:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
