An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence
Qizhen Zhang, Ankush Garg, Jakob Foerster, Niladri Chatterji, Kshitiz Malik, Mike Lewis

TL;DR
This study systematically investigates how noisy data impacts loss divergence during large language model pretraining, revealing that noise induces divergence with effects depending on noise type, amount, and model size.
Contribution
It provides the first large-scale empirical analysis of noise-induced divergence in LLM pretraining, including diagnostics to distinguish divergence causes.
Findings
Noisy data causes training loss divergence in LLMs.
Divergence probability depends on noise type, amount, and model size.
Distinct activation patterns differentiate noise-induced divergence from high learning rate failures.
Abstract
Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly understood.In this work, we present a systematic empirical study of whether noisy data causes LLM pretraining divergences and how it does so. By injecting controlled synthetic uniformly random noise into otherwise clean datasets, we analyze training dynamics across model sizes ranging from 480M to 5.2B parameters. We show that noisy data indeed induces training loss divergence, and that the probability of divergence depends strongly on the noise type, amount of noise, and model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
