FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Guilherme Penedo, Hynek Kydl\'i\v{c}ek, Vinko Sabol\v{c}ec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf

TL;DR
FineWeb2 introduces an adaptable pre-training data pipeline that efficiently creates high-quality multilingual datasets, significantly improving model performance across diverse languages and scaling to over 1000 languages.
Contribution
The paper presents a novel, adaptable data curation pipeline for multilingual pre-training, along with a large-scale dataset and evaluation methods for diverse languages.
Findings
Pipeline improves multilingual model performance
Dataset scaling to over 1000 languages
Enhanced data balancing and deduplication methods
Abstract
Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗yanolja/YanoljaNEXT-Rosetta-12B-2510model· 202 dl· ♡ 30202 dl♡ 30
- 🤗Bedovyy/YanoljaNEXT-Rosetta-12B-2510-FP8-Dynamicmodel
- 🤗yanolja/YanoljaNEXT-Rosetta-4B-2510model· 16 dl· ♡ 1016 dl♡ 10
- 🤗yanolja/YanoljaNEXT-Rosetta-4B-2510-GGUFmodel· 120 dl120 dl
- 🤗yanolja/YanoljaNEXT-Rosetta-12B-2510-GGUFmodel· 233 dl233 dl
- 🤗yanolja/YanoljaNEXT-Rosetta-4B-2511model· 221 dl· ♡ 10221 dl♡ 10
- 🤗yanolja/YanoljaNEXT-Rosetta-4B-2511-FP8model· 10 dl10 dl
- 🤗yanolja/YanoljaNEXT-Rosetta-4B-2511-GGUFmodel· 566 dl· ♡ 2566 dl♡ 2
- 🤗yanolja/YanoljaNEXT-Rosetta-27B-2511model· 24 dl· ♡ 3524 dl♡ 35
- 🤗yanolja/YanoljaNEXT-Rosetta-27B-2511-FP8model· 522 dl· ♡ 2522 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Technology and Assessment · Online Learning and Analytics · Intelligent Tutoring Systems and Adaptive Learning
MethodsSparse Evolutionary Training
