esCorpius: A Massive Spanish Crawling Corpus

Asier Guti\'errez-Fandi\~no; David P\'erez-Fern\'andez; Jordi; Armengol-Estap\'e; David Griol; Zoraida Callejas

arXiv:2206.15147·cs.CL·July 4, 2022·1 cites

esCorpius: A Massive Spanish Crawling Corpus

Asier Guti\'errez-Fandi\~no, David P\'erez-Fern\'andez, Jordi, Armengol-Estap\'e, David Griol, Zoraida Callejas

PDF

Open Access 3 Datasets

TL;DR

esCorpius is the largest high-quality Spanish web crawling corpus derived from nearly 1 petabyte of data, enabling better language models for Spanish NLP tasks.

Contribution

The paper introduces esCorpius, a novel, large-scale, high-quality Spanish corpus with advanced cleaning and deduplication, filling a significant gap in Spanish NLP resources.

Findings

01

Largest Spanish web corpus with high quality

02

Includes source URLs for regulatory compliance

03

Facilitates improved Spanish language models

Abstract

In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Web Data Mining and Analysis · Advanced Data Storage Technologies