Do we really have to filter out random noise in pre-training data for language models?

Jinghan Ru; Yuxin Xie; Xianwei Zhuang; Yuguo Yin; Zhihui Guo; Zhiming Liu; Qianli Ren; Yuexian Zou

arXiv:2502.06604·cs.CL·May 19, 2025

Do we really have to filter out random noise in pre-training data for language models?

Jinghan Ru, Yuxin Xie, Xianwei Zhuang, Yuguo Yin, Zhihui Guo, Zhiming Liu, Qianli Ren, Yuexian Zou

PDF

Open Access

TL;DR

This paper investigates the impact of random noise in web-scale pre-training data for language models, revealing that noise has less effect on training loss than expected but can degrade downstream performance, and proposes a new denoising method.

Contribution

It provides the first systematic analysis of random noise in pre-training data and introduces a novel Local Gradient Matching loss to improve downstream task robustness.

Findings

01

Random noise increases NTP loss less than its proportion in data.

02

Theoretical insights explain the limited impact of noise on training loss.

03

The proposed method improves downstream performance across multiple benchmarks.

Abstract

Web-scale pre-training datasets are the cornerstone of LLMs' success. However, text data curated from the Internet inevitably contains random noise caused by decoding errors or unregulated web content. In contrast to previous works that focus on low quality or synthetic data, our study \textbf{provides the first systematic investigation of such random noise through a cohesive ``What-Why-How'' framework.} Surprisingly, we observed that the resulting increase in the loss of next-token prediction (NTP) was significantly lower than the proportion of random noise even when the model was scaled up to 2.7B. We provide a theoretical justification for this phenomenon, which also elucidates the success of multilingual models and can be applied to multimodal models. On the other hand, experiments show that the model's performance in downstream tasks is not based solely on the NTP loss, which means…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsFocus