Spanish Pre-trained BERT Model and Evaluation Data
Jos\'e Ca\~nete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin, Kang, Jorge P\'erez

TL;DR
This paper introduces a Spanish-specific BERT model and a comprehensive set of evaluation tasks, improving performance on Spanish NLP benchmarks and providing resources for future research.
Contribution
The paper presents a new Spanish BERT model and a unified Spanish benchmark suite, facilitating better NLP performance and resource sharing for Spanish language processing.
Findings
The Spanish BERT model outperforms multilingual models on most tasks.
Achieved state-of-the-art results on several Spanish NLP benchmarks.
Public release of the model, data, and benchmark suite.
Abstract
The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
