Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Thao Nguyen; Yang Li; Olga Golovneva; Luke Zettlemoyer; Sewoong Oh; Ludwig Schmidt; Xian Li

arXiv:2506.04689·cs.CL·September 16, 2025

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li

PDF

Open Access 5 Datasets

TL;DR

This paper introduces REWIRE, a method to transform and recycle discarded web data into high-quality training material, significantly improving language model performance and data efficiency at multiple scales.

Contribution

REWIRE is a novel data augmentation technique that enriches low-quality web texts, enabling more effective pre-training data utilization and outperforming existing synthetic data methods.

Findings

01

Recycling discarded web data improves model performance by up to 2.5 percentage points.

02

Mixing raw and synthetic data surpasses using double the web data.

03

Approximately 82% of the mixed texts are generated from transformed low-quality documents.

Abstract

Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Topic Modeling · Data Quality and Management