An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

Qizhen Zhang; Ankush Garg; Jakob Foerster; Niladri Chatterji; Kshitiz Malik; Mike Lewis

arXiv:2602.02400·cs.LG·February 3, 2026

An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

Qizhen Zhang, Ankush Garg, Jakob Foerster, Niladri Chatterji, Kshitiz Malik, Mike Lewis

PDF

Open Access

TL;DR

This study systematically investigates how noisy data impacts loss divergence during large language model pretraining, revealing that noise induces divergence with effects depending on noise type, amount, and model size.

Contribution

It provides the first large-scale empirical analysis of noise-induced divergence in LLM pretraining, including diagnostics to distinguish divergence causes.

Findings

01

Noisy data causes training loss divergence in LLMs.

02

Divergence probability depends on noise type, amount, and model size.

03

Distinct activation patterns differentiate noise-induced divergence from high learning rate failures.

Abstract

Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly understood.In this work, we present a systematic empirical study of whether noisy data causes LLM pretraining divergences and how it does so. By injecting controlled synthetic uniformly random noise into otherwise clean datasets, we analyze training dynamics across model sizes ranging from 480M to 5.2B parameters. We show that noisy data indeed induces training loss divergence, and that the probability of divergence depends strongly on the noise type, amount of noise, and model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification