Compute-Constrained Data Selection
Junjie Oscar Yin, Alexander M. Rush

TL;DR
This paper investigates compute-constrained data selection for fine-tuning large language models, revealing that cheaper methods often outperform more expensive ones when considering total compute costs.
Contribution
It formalizes a cost-aware data selection framework and demonstrates that many powerful selection methods are not compute-optimal under budget constraints.
Findings
Cheaper data selection methods often outperform expensive ones in total compute cost.
Perplexity-based selection requires a 5x larger model size for compute optimality.
Gradient-based selection requires a 10x larger model size for compute optimality.
Abstract
Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we…
Peer Reviews
Decision·ICLR 2025 Poster
I found this is an important experimental contribution for practitioners and academics alike, and is likely to be heavily cited in the future. While there will inevitably be some discussion of whether they compared to all the right and best methods, I think that's in the details: they compared good and sufficiently recent example methods from high level strategies and showed significant enough differences that seem endemic to these different strategies.
The weaknesses I detail below should all be corrected, but they are all minor, none of them individually or in total would be a good reason to reject the paper. SECTION 3 PROBLEMS: At the beginning of Section 3: “Our goal is to find the optimal subset S ⊆ X” pretty sure you mean subset S ⊆ D there? I think you are implying that the train set is not necessarily IID with the validation set, but that the validation set is IID with the test set. All I see you say is that the validation set is “c
This paper addresses compute-efficient fine-tuning, which is an important task in training LLM. Extensive simulations are conducted to provide empirical evidence and support the framework.
1. Although the author claims some simple methods such as Lexicon outperform the complex ones such as Perplexity and Gradient, as shown in Figure 1, the complex ones perform quite well especially under medium and large budget situations. It would be more important to study the tipping point, where the performance gains plateau became flat. This is the place where further increases in computing resources yield diminishing returns. 2. It is not surprising to see the tradeoff between performance a
1. This paper considers an interesting problem, data selection under computational constraints, and has interesting observations that the initial cost cannot be neglected when considering the computational budget.
1. (Major) Lack of novelty: although this paper proposes a framework for analyzing the computational cost of each data selection method, it does not provide any new techniques based on this framework. Furthermore, the key observation is not very surprising: the computational cost contains an initial cost when evaluating the validation set, thus the perplexity-based or the gradient-based is clearly not optimal under a limited compute budget. 2. (Major) Lack of soundness: a) the parametric model i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Fuzzy Logic and Control Systems · Fault Detection and Control Systems
