The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru,, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei,, Julien Launay

TL;DR
This paper introduces the RefinedWeb dataset, demonstrating that carefully filtered web data alone can train large language models that outperform those trained on curated corpora, with abundant high-quality data available from web sources.
Contribution
The authors show that web data, when properly filtered and deduplicated, can replace curated datasets for training high-performing language models, scaling to trillions of tokens.
Findings
Models trained on RefinedWeb outperform state-of-the-art models from The Pile.
Five trillion tokens of high-quality web data are obtainable from CommonCrawl.
Public release includes 600 billion tokens and models with 1.3 and 7.5 billion parameters.
Abstract
Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗tiiuae/falcon-40bmodel· 22k dl· ♡ 243322k dl♡ 2433
- 🤗tiiuae/falcon-7bmodel· 153k dl· ♡ 1099153k dl♡ 1099
- 🤗tiiuae/falcon-7b-instructmodel· 58k dl· ♡ 103158k dl♡ 1031
- 🤗tiiuae/falcon-rw-1bmodel· 12k dl· ♡ 11812k dl♡ 118
- 🤗tiiuae/falcon-rw-7bmodel· 335 dl· ♡ 17335 dl♡ 17
- 🤗tiiuae/falcon-40b-instructmodel· 41k dl· ♡ 117741k dl♡ 1177
- 🤗michaelfeil/ct2fast-falcon-7b-instructmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗michaelfeil/ct2fast-falcon-40bmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗michaelfeil/ct2fast-falcon-40b-instructmodel· 8 dl· ♡ 28 dl♡ 2
- 🤗michaelfeil/ct2fast-falcon-7bmodel· 4 dl· ♡ 14 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
