Dynamic Gradient Alignment for Online Data Mixing
Simin Fan, David Grangier, Pierre Ablin

TL;DR
This paper introduces Dynamic Gradient Alignment (DGA), an online method for optimizing data mixtures to improve large language model performance on specific tasks with limited data, outperforming importance sampling.
Contribution
The paper presents DGA, a scalable online gradient alignment algorithm that dynamically estimates optimal data mixtures for task-specific LLM training without retraining.
Findings
DGA outperforms importance sampling in small pre-training set scenarios.
DGA effectively handles limited specialized data, avoiding overfitting.
DGA achieves competitive results with minimal overhead.
Abstract
The composition of training data mixtures is critical for effectively training large language models (LLMs), as it directly impacts their performance on downstream tasks. Our goal is to identify an optimal data mixture to specialize an LLM for a specific task with access to only a few examples. Traditional approaches to this problem include ad-hoc reweighting methods, importance sampling, and gradient alignment techniques. This paper focuses on gradient alignment and introduces Dynamic Gradient Alignment (DGA), a scalable online gradient alignment algorithm. DGA dynamically estimates the pre-training data mixture on which the models' gradients align as well as possible with those of the model on the specific task. DGA is the first gradient alignment approach that incurs minimal overhead compared to standard pre-training and outputs a competitive model, eliminating the need for…
Peer Reviews
Decision·Submitted to ICLR 2025
(1) The proposed DGA shows advantages over traditional importance sampling approaches in two challenging scenarios, that is, the number of training samples in each domain is limited instead of the unrealistic assumption of unlimited number of tokens, and the domain granularity is quite large making traditional domain reweighting methods cause intractable computational overheads. (2) The paper proposed two major innovations in DGA, including incorporating EMA term into domain reweighting to miti
(1) For empirical validations, the paper used Redpajama-v2 as the pre-training data, which has 30T tokens after filtering and deduplication. This is non-trivial amount of data. However, the paper pre-trained models with 125M, 350M, and 750M number of parameters. In comparison, some SOTA LLMs are trained on smaller or comparable amount of data with much larger model sizes, for example, Llama 3 405B trained on 15.6T tokens, QWen2.5-8B trained on 18T tokens, OpenLLaMA 3B, 7B, and 13B trained on 1T
1,DGA effectively addresses the overfitting issue of importance sampling in scenarios with limited data. 2, By incorporating the EMA term, DGA effectively addresses the issue of training instability. 3, By combining with importance sampling, DGA resolves the issue of "scaling linearly with the number of domains k". Even when there are a large number of domains, it can also demonstrate its advantages.
1, DGA's reasoning accuracy on datasets like MMLU is not good, but the authors only emphasize that DGA has strong capabilities in language modeling. The performance on downstream tasks should be related to reasoning accuracy, rather than the capability of language modeling. 2, The authors did not compare the performance differences between DGA and other methods under the paradigm of pre-training + fine-tuning or pre-training + ICL, but LLMs typically adapt to downstream tasks using these paradi
1. The paper introduces Dynamic Gradient Alignment (DGA), a novel online gradient alignment algorithm that dynamically adjusts training data mixtures. 2. The experiments are well-designed, covering key scenarios of small pretraining sets and insufficient specialized data, using the MMLU benchmark. 3. The results clearly demonstrate the advantages of DGA over importance sampling and uniform baselines, with detailed analysis of different model scales and domain granularities
1. The paper provides a brief description of the theoretical foundation of DGA, lacking in-depth mathematical derivation and theoretical analysis. There is no rigorous proof of convergence and stability for DGA, which raises concerns about the robustness and reliability of the proposed method. 2. The paper does not compare DGA with other advanced domain reweighting methods. This omission makes it difficult to assess the relative performance and advantages of DGA compared to existing methods. 3.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Face recognition and analysis · Computer Graphics and Visualization Techniques
MethodsSparse Evolutionary Training · ALIGN
