Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li

TL;DR
This paper introduces REWIRE, a method to transform and recycle discarded web data into high-quality training material, significantly improving language model performance and data efficiency at multiple scales.
Contribution
REWIRE is a novel data augmentation technique that enriches low-quality web texts, enabling more effective pre-training data utilization and outperforming existing synthetic data methods.
Findings
Recycling discarded web data improves model performance by up to 2.5 percentage points.
Mixing raw and synthetic data surpasses using double the web data.
Approximately 82% of the mixed texts are generated from transformed low-quality documents.
Abstract
Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Topic Modeling · Data Quality and Management
