Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang,, Navdeep Jaitly

TL;DR
This paper introduces WRAP, a method that uses instruction-tuned models to rephrase web data, significantly improving training efficiency and model performance by enhancing data quality and diversity.
Contribution
The paper proposes WRAP, a novel data augmentation technique using rephrased web documents to accelerate pre-training and boost language model performance.
Findings
Speeds up pre-training by approximately 3x on noisy datasets.
Improves perplexity by over 10% at the same compute budget.
Enhances zero-shot question answering accuracy across multiple tasks.
Abstract
Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training () that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrases. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by . At the same pre-training compute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
