HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
Stephan Oepen, Nikolay Arefev, Mikko Aulamo, Marta Ba\~n\'on, Maja Buljan, Laurie Burchell, Lucas Charpentier, Pinzhen Chen, Mariya Fedorova, Ona de Gibert, Barry Haddow, Jan Haji\v{c}, Jind\v{r}ich Helcl, Andrey Kutuzov, Veronika Laippala, Zihao Li, Risto Luukkonen

TL;DR
This paper introduces HPLT 3.0, a massive multilingual dataset and models for nearly 200 languages, including evaluation benchmarks, monolingual models, and parallel texts, advancing multilingual NLP research.
Contribution
It provides the largest open multilingual dataset, comprehensive benchmarks, and a suite of monolingual and multilingual models for diverse NLP tasks.
Findings
Data quality is validated through statistical and manual inspection.
Models trained on this data achieve competitive performance across languages.
The dataset includes extensive parallel texts and synthesized corpora for translation tasks.
Abstract
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HPLT/hplt-pre3-global-spa_Latn-llama-2b-30btmodel· 4 dl4 dl
- 🤗HPLT/hplt-pre3-global-fra_Latn-llama-2b-30btmodel· 5 dl5 dl
- 🤗HPLT/hplt-pre3-global-fin_Latn-llama-2b-30btmodel· 3 dl3 dl
- 🤗HPLT/hplt-3.0-ukr_Cyrl-llama-2b-100btmodel· 1 dl1 dl
- 🤗HPLT/hplt-3.0-fra_Latn-llama-2b-100btmodel· 19 dl19 dl
- 🤗HPLT/hplt-2.0-eus_Latn-llama-2b-30btmodel· 2 dl2 dl
- 🤗HPLT/hplt-3.0-nor_Latn-llama-2b-100btmodel· 1.5k dl1.5k dl
- 🤗HPLT/hplt-2.0-fin_Latn-llama-2b-30btmodel· 3 dl3 dl
- 🤗HPLT/hplt-3.0-eus_Latn-llama-2b-100btmodel· 1 dl1 dl
- 🤗HPLT/hplt-3.0-glg_Latn-llama-2b-100btmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
