A New Massive Multilingual Dataset for High-Performance Language   Technologies

Ona de Gibert; Graeme Nail; Nikolay Arefyev; Marta Ba\~n\'on; Jelmer; van der Linde; Shaoxiong Ji; Jaume Zaragoza-Bernabeu; Mikko Aulamo; Gema; Ram\'irez-S\'anchez; Andrey Kutuzov; Sampo Pyysalo; Stephan Oepen; J\"org; Tiedemann

arXiv:2403.14009·cs.CL·March 22, 2024·5 cites

A New Massive Multilingual Dataset for High-Performance Language Technologies

Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Ba\~n\'on, Jelmer, van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema, Ram\'irez-S\'anchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, J\"org, Tiedemann

PDF

Open Access 10 Models

TL;DR

This paper introduces HPLT, a massive multilingual dataset from web crawls, including monolingual and bilingual corpora for 75 languages, designed to advance language modeling and translation.

Contribution

The paper presents a large-scale, open multilingual dataset with novel data collection and processing methods for low-resource languages, enabling improved NLP applications.

Findings

01

Contains ~5.6 trillion tokens across 75 languages

02

Includes 96 million aligned sentence pairs for 18 language pairs

03

Provides one of the largest open multilingual corpora available

Abstract

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies