HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Stephan Oepen; Nikolay Arefev; Mikko Aulamo; Marta Ba\~n\'on; Maja Buljan; Laurie Burchell; Lucas Charpentier; Pinzhen Chen; Mariya Fedorova; Ona de Gibert; Barry Haddow; Jan Haji\v{c}; Jind\v{r}ich Helcl; Andrey Kutuzov; Veronika Laippala; Zihao Li; Risto Luukkonen; Bhavitvya Malik; Vladislav Mikhailov; Amanda Myntti; Dayy\'an O'Brien; Lucie Pol\'akov\'a; Sampo Pyysalo; Gema Ram\'irez S\'anchez; Janine Siewert; Pavel Stepachev; J\"org Tiedemann; Teemu Vahtola; Du\v{s}an Vari\v{s}; Fedor Vitiugin; Tea Vojt\v{e}chov\'a; Jaume Zaragoza

arXiv:2511.01066·cs.CL·April 21, 2026

HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Stephan Oepen, Nikolay Arefev, Mikko Aulamo, Marta Ba\~n\'on, Maja Buljan, Laurie Burchell, Lucas Charpentier, Pinzhen Chen, Mariya Fedorova, Ona de Gibert, Barry Haddow, Jan Haji\v{c}, Jind\v{r}ich Helcl, Andrey Kutuzov, Veronika Laippala, Zihao Li, Risto Luukkonen

PDF

50 Models 6 Datasets

TL;DR

This paper introduces HPLT 3.0, a massive multilingual dataset and models for nearly 200 languages, including evaluation benchmarks, monolingual models, and parallel texts, advancing multilingual NLP research.

Contribution

It provides the largest open multilingual dataset, comprehensive benchmarks, and a suite of monolingual and multilingual models for diverse NLP tasks.

Findings

01

Data quality is validated through statistical and manual inspection.

02

Models trained on this data achieve competitive performance across languages.

03

The dataset includes extensive parallel texts and synthesized corpora for translation tasks.

Abstract

We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.