A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Ba\~n\'on, Jelmer, van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema, Ram\'irez-S\'anchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, J\"org, Tiedemann

TL;DR
This paper introduces HPLT, a massive multilingual dataset from web crawls, including monolingual and bilingual corpora for 75 languages, designed to advance language modeling and translation.
Contribution
The paper presents a large-scale, open multilingual dataset with novel data collection and processing methods for low-resource languages, enabling improved NLP applications.
Findings
Contains ~5.6 trillion tokens across 75 languages
Includes 96 million aligned sentence pairs for 18 language pairs
Provides one of the largest open multilingual corpora available
Abstract
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BSC-LT/salamandra-7b-instructmodel· 81k dl· ♡ 7781k dl♡ 77
- 🤗BSC-LT/salamandraTA-7b-instructmodel· 1.6k dl· ♡ 251.6k dl♡ 25
- 🤗BSC-LT/salamandra-7bmodel· 355 dl· ♡ 29355 dl♡ 29
- 🤗BSC-LT/salamandra-2bmodel· 1.3k dl· ♡ 251.3k dl♡ 25
- 🤗BSC-LT/salamandra-2b-instructmodel· 6.3k dl· ♡ 276.3k dl♡ 27
- 🤗robbiemu/salamandra-2b-instructmodel· 92 dl92 dl
- 🤗RichardErkhov/BSC-LT_-_salamandra-7b-instruct-ggufmodel· 141 dl141 dl
- 🤗RichardErkhov/BSC-LT_-_salamandra-7b-ggufmodel· 73 dl73 dl
- 🤗robbiemu/salamandra-2bmodel· 111 dl111 dl
- 🤗RichardErkhov/BSC-LT_-_salamandra-2b-instruct-ggufmodel· 356 dl356 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies
