FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo; Hynek Kydl\'i\v{c}ek; Vinko Sabol\v{c}ec; Bettina Messmer; Negar Foroutan; Amir Hossein Kargaran; Colin Raffel; Martin Jaggi; Leandro Von Werra; Thomas Wolf

arXiv:2506.20920·cs.CL·June 27, 2025

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo, Hynek Kydl\'i\v{c}ek, Vinko Sabol\v{c}ec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf

PDF

Open Access 1 Repo 10 Models 5 Datasets

TL;DR

FineWeb2 introduces an adaptable pre-training data pipeline that efficiently creates high-quality multilingual datasets, significantly improving model performance across diverse languages and scaling to over 1000 languages.

Contribution

The paper presents a novel, adaptable data curation pipeline for multilingual pre-training, along with a large-scale dataset and evaluation methods for diverse languages.

Findings

01

Pipeline improves multilingual model performance

02

Dataset scaling to over 1000 languages

03

Enhanced data balancing and deduplication methods

Abstract

Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huggingface/fineweb-2
noneOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Assessment · Online Learning and Analytics · Intelligent Tutoring Systems and Adaptive Learning

MethodsSparse Evolutionary Training