Quantifying the Importance of Data Alignment in Downstream Model Performance

Krrish Chawla; Aryan Sahai; Mario DePavia; Sudharsan Sundar; Brando Miranda; Elyas Obbad; Sanmi Koyejo

arXiv:2501.08496·cs.CL·July 4, 2025

Quantifying the Importance of Data Alignment in Downstream Model Performance

Krrish Chawla, Aryan Sahai, Mario DePavia, Sudharsan Sundar, Brando Miranda, Elyas Obbad, Sanmi Koyejo

PDF

Open Access

TL;DR

This paper demonstrates that data alignment between training and evaluation datasets significantly impacts large language model performance, often more than dataset size, especially in specialized tasks like Autoformalization.

Contribution

It introduces a quantitative measure of data alignment and shows its strong correlation with downstream performance, challenging the focus on data quantity in LLM training.

Findings

01

Higher data alignment correlates with lower model loss.

02

Data alignment impacts performance more than dataset size.

03

Alignment measurement guides better training strategies.

Abstract

Contrary to the conventional emphasis on dataset size, we explore the role of data alignment -- an often overlooked aspect of data quality -- in training capable Large Language Models (LLMs). To do so, we use the Task2Vec-based alignment coefficient, a quantitative measure of the similarity between two datasets, to quantify the impact of alignment between training data and evaluation data on downstream performance. In particular, we conduct controlled \textit{interventional} experiments for two settings: 1. the impact of increased alignment coefficients between various pre-training (pt) against evaluation datasets, and 2. the impact of increased alignment coefficients between domain specific fine-tuning (ft) against domain specific evaluation. The domain specific task we explore is Autoformalization -- the machine translation task between natural language and code for formal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications