Compute-Constrained Data Selection

Junjie Oscar Yin; Alexander M. Rush

arXiv:2410.16208·cs.LG·April 9, 2025

Compute-Constrained Data Selection

Junjie Oscar Yin, Alexander M. Rush

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper investigates compute-constrained data selection for fine-tuning large language models, revealing that cheaper methods often outperform more expensive ones when considering total compute costs.

Contribution

It formalizes a cost-aware data selection framework and demonstrates that many powerful selection methods are not compute-optimal under budget constraints.

Findings

01

Cheaper data selection methods often outperform expensive ones in total compute cost.

02

Perplexity-based selection requires a 5x larger model size for compute optimality.

03

Gradient-based selection requires a 10x larger model size for compute optimality.

Abstract

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

I found this is an important experimental contribution for practitioners and academics alike, and is likely to be heavily cited in the future. While there will inevitably be some discussion of whether they compared to all the right and best methods, I think that's in the details: they compared good and sufficiently recent example methods from high level strategies and showed significant enough differences that seem endemic to these different strategies.

Weaknesses

The weaknesses I detail below should all be corrected, but they are all minor, none of them individually or in total would be a good reason to reject the paper. SECTION 3 PROBLEMS: At the beginning of Section 3: “Our goal is to find the optimal subset S ⊆ X” pretty sure you mean subset S ⊆ D there? I think you are implying that the train set is not necessarily IID with the validation set, but that the validation set is IID with the test set. All I see you say is that the validation set is “c

Reviewer 02Rating 5Confidence 3

Strengths

This paper addresses compute-efficient fine-tuning, which is an important task in training LLM. Extensive simulations are conducted to provide empirical evidence and support the framework.

Weaknesses

1. Although the author claims some simple methods such as Lexicon outperform the complex ones such as Perplexity and Gradient, as shown in Figure 1, the complex ones perform quite well especially under medium and large budget situations. It would be more important to study the tipping point, where the performance gains plateau became flat. This is the place where further increases in computing resources yield diminishing returns. 2. It is not surprising to see the tradeoff between performance a

Reviewer 03Rating 5Confidence 3

Strengths

1. This paper considers an interesting problem, data selection under computational constraints, and has interesting observations that the initial cost cannot be neglected when considering the computational budget.

Weaknesses

1. (Major) Lack of novelty: although this paper proposes a framework for analyzing the computational cost of each data selection method, it does not provide any new techniques based on this framework. Furthermore, the key observation is not very surprising: the computational cost contains an initial cost when evaluating the validation set, thus the perplexity-based or the gradient-based is clearly not optimal under a limited compute budget. 2. (Major) Lack of soundness: a) the parametric model i

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Fuzzy Logic and Control Systems · Fault Detection and Control Systems