Efficiently Estimating Data Efficiency for Language Model Fine-tuning
Gyung Hyun Je, Colin Raffel

TL;DR
This paper introduces a metric and method to predict the data efficiency of fine-tuning large language models on specific tasks, reducing annotation costs by accurately estimating the required number of training examples.
Contribution
The paper proposes a novel gradient cosine similarity-based metric to predict data efficiency, validated across diverse tasks with significant annotation cost savings.
Findings
Achieved 8.6% error in data efficiency prediction
Reduced unnecessary annotations by hundreds per task
Validated on 30 specialized tasks
Abstract
While large language models (LLMs) demonstrate reasonable zero-shot capability across many downstream tasks, fine-tuning is a common practice to improve their performance. However, a task's data efficiency--i.e., the number of fine-tuning examples needed to achieve a desired level of performance--is often unknown, resulting in costly cycles of incremental annotation and retraining. Indeed, we demonstrate across a curated set of 30 specialized tasks that performant LLMs may struggle zero-shot but can attain stronger performance after fine-tuning. This motivates the need for methods to predict a task's data efficiency without requiring incremental annotation. After introducing a concrete metric that quantifies a task's data efficiency, we propose using the gradient cosine similarity of low-confidence examples to predict data efficiency based on a small number of labeled samples. We…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper attempts to quantify task-specific data efficiency that currently relies on expensive trial and error and costly empirical tuning. Understanding the data efficiency can help wisely expend the data annotation budget per task. Moreover, this method requires no labeled validation data, unlike most domain adaptation or reweighting-based methods. 2. The experimental setup is well motivated, with thorough ablations and multi-model validation (Llama-3.1, Mistral, Qwen). The proposed CoS-L
1. The mapping from predicted AUC to performance curve relies on simplifying assumptions ( for instance, human-level saturation in 5k examples per task) that may not hold for long-tailed or complex tasks (for instance, the discussion on MMLU in Section 6). 2. Cos-Low may conflate data noise / out-of-distribution samples with genuine difficulty due to the reliance on low-confidence samples. 3. Task interaction effects remain unaddressed - modern LLMs are used in a multi-task setting and this appr
+ The authors empirically demonstrate a strong correlation between their proposed CoS-Low metric and the actual data efficiency of a task, making it a reliable signal for prediction. + The study validates its core motivation by showing that across a diverse set of 30 tasks, fine-tuning consistently leads to significant performance improvements over the model's initial zero-shot capability, highlighting the practical need for such an estimation method.
- The analysis is capped at a 5000-example budget, so it is not clear how the method works beyond that point. - The authors do not evaluate models on cross-benchmarks, which is hard to measure the impact of training on cross-domain. - The paper assumes that performance is a monotonically non-decreasing function of the data size. The authors acknowledge this is a simplification and state that in the rare cases where it wasn't true, they adjusted the data to enforce the assumption, which may not
1. The paper takes a bold and interesting stance by revisiting linearity as a desirable inductive bias in relational models. This goes against the prevailing trend of ever more nonlinear message-passing architectures. The argument is well-justified both intuitively and empirically: simpler linear propagation can yield better extrapolation under graph shifts. 2. The authors provide clear derivations showing that ReLiNet’s linear relational operator can be viewed as a constrained instance of a fir
1. Although ReLiNet excels on graph-structured data, its utility for non-graph relational reasoning (e.g., text, vision-language relational datasets) is not demonstrated. The claim that “ReLiNet generalizes to any structured relation learning task” feels overstated. 2. On the larger OGB datasets, the performance improvement over GAT and Graph Transformer baselines is modest (~0.5–1% absolute). While statistically significant, it may not be practically impactful without additional benefits like e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education
