A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets
Ryan Lagasse, Aidan Kierans, Avijit Ghosh, Shiri Dori-Hacohen

TL;DR
This paper proposes a new scaling law for fine-tuning large language models that considers data composition, such as dataset volume, to improve token efficiency under fixed compute budgets.
Contribution
It introduces a scaling law that explicitly accounts for data composition factors, enhancing understanding of token efficiency in resource-constrained LLM fine-tuning.
Findings
Data composition significantly impacts token efficiency.
Refined scaling laws improve fine-tuning strategies.
Experiments validate the importance of dataset volume.
Abstract
We introduce a scaling law for fine-tuning large language models (LLMs) under fixed compute budgets that explicitly accounts for data composition. Conventional approaches measure training data solely by total tokens, yet the number of examples and their average token length -- what we term \emph{dataset volume} -- play a decisive role in model performance. Our formulation is tuned following established procedures. Experiments on the BRICC dataset \cite{salavati2024reducing} and subsets of the MMLU dataset \cite{hendrycks2021measuringmassivemultitasklanguage}, evaluated under multiple subsampling strategies, reveal that data composition significantly affects token efficiency. These results motivate refined scaling laws for practical LLM fine-tuning in resource-constrained settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
