Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

Xu Guo; Runyu Peng; Jian Tong; Yunhua Zhou; Haijun Lv; Zhihui Lu; Qipeng Guo

arXiv:2605.10129·cs.CL·May 12, 2026

Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

Xu Guo, Runyu Peng, Jian Tong, Yunhua Zhou, Haijun Lv, Zhihui Lu, Qipeng Guo

PDF

1 Repo

TL;DR

This paper introduces a synthetic pre-pre-training stage that enhances language model robustness to noisy data, reducing the need for natural text during pre-training.

Contribution

It proposes a lightweight synthetic pre-pre-training method that improves noise resistance in language models, with detailed mechanistic insights.

Findings

01

Synthetic PPT improves robustness across various noise levels.

02

A 65M token synthetic PPT matches baseline performance with 49% fewer natural tokens.

03

Models gradually downweight noisy tokens during pre-training.

Abstract

Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49\% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guox18/formal-language-prepretraining
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.