Quantifying the Importance of Data Alignment in Downstream Model Performance
Krrish Chawla, Aryan Sahai, Mario DePavia, Sudharsan Sundar, Brando Miranda, Elyas Obbad, Sanmi Koyejo

TL;DR
This paper demonstrates that data alignment between training and evaluation datasets significantly impacts large language model performance, often more than dataset size, especially in specialized tasks like Autoformalization.
Contribution
It introduces a quantitative measure of data alignment and shows its strong correlation with downstream performance, challenging the focus on data quantity in LLM training.
Findings
Higher data alignment correlates with lower model loss.
Data alignment impacts performance more than dataset size.
Alignment measurement guides better training strategies.
Abstract
Contrary to the conventional emphasis on dataset size, we explore the role of data alignment -- an often overlooked aspect of data quality -- in training capable Large Language Models (LLMs). To do so, we use the Task2Vec-based alignment coefficient, a quantitative measure of the similarity between two datasets, to quantify the impact of alignment between training data and evaluation data on downstream performance. In particular, we conduct controlled \textit{interventional} experiments for two settings: 1. the impact of increased alignment coefficients between various pre-training (pt) against evaluation datasets, and 2. the impact of increased alignment coefficients between domain specific fine-tuning (ft) against domain specific evaluation. The domain specific task we explore is Autoformalization -- the machine translation task between natural language and code for formal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications
