TL;DR
This paper introduces a synthetic pre-pre-training stage that enhances language model robustness to noisy data, reducing the need for natural text during pre-training.
Contribution
It proposes a lightweight synthetic pre-pre-training method that improves noise resistance in language models, with detailed mechanistic insights.
Findings
Synthetic PPT improves robustness across various noise levels.
A 65M token synthetic PPT matches baseline performance with 49% fewer natural tokens.
Models gradually downweight noisy tokens during pre-training.
Abstract
Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49\% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
