esCorpius: A Massive Spanish Crawling Corpus
Asier Guti\'errez-Fandi\~no, David P\'erez-Fern\'andez, Jordi, Armengol-Estap\'e, David Griol, Zoraida Callejas

TL;DR
esCorpius is the largest high-quality Spanish web crawling corpus derived from nearly 1 petabyte of data, enabling better language models for Spanish NLP tasks.
Contribution
The paper introduces esCorpius, a novel, large-scale, high-quality Spanish corpus with advanced cleaning and deduplication, filling a significant gap in Spanish NLP resources.
Findings
Largest Spanish web corpus with high quality
Includes source URLs for regulatory compliance
Facilitates improved Spanish language models
Abstract
In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Advanced Data Storage Technologies
