Rephrasing the Web: A Recipe for Compute and Data-Efficient Language   Modeling

Pratyush Maini; Skyler Seto; He Bai; David Grangier; Yizhe Zhang,; Navdeep Jaitly

arXiv:2401.16380·cs.CL·January 30, 2024·1 cites

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang,, Navdeep Jaitly

PDF

Open Access

TL;DR

This paper introduces WRAP, a method that uses instruction-tuned models to rephrase web data, significantly improving training efficiency and model performance by enhancing data quality and diversity.

Contribution

The paper proposes WRAP, a novel data augmentation technique using rephrased web documents to accelerate pre-training and boost language model performance.

Findings

01

Speeds up pre-training by approximately 3x on noisy datasets.

02

Improves perplexity by over 10% at the same compute budget.

03

Enhances zero-shot question answering accuracy across multiple tasks.

Abstract

Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training ( $WRAP$ ) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrases. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by $\sim 3 x$ . At the same pre-training compute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques