RoBERTuito: a pre-trained language model for social media text in Spanish
Juan Manuel P\'erez, Dami\'an A. Furman, Laura Alonso Alemany, Franco, Luque

TL;DR
RoBERTuito is a Spanish language model trained on 500 million tweets, outperforming existing models on user-generated text tasks and showing competitive results in code-switching benchmarks.
Contribution
This work introduces RoBERTuito, a domain-specific pre-trained language model for Spanish social media text, and demonstrates its superior performance on relevant NLP tasks.
Findings
Outperforms other Spanish language models on user-generated text tasks
Achieves top results on some English-Spanish code-switching benchmarks
Competitive performance against monolingual English models
Abstract
Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for Natural Language Understanding tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks. However, for languages other than English such models are not widely available. In this work, we present RoBERTuito, a pre-trained language model for user-generated text in Spanish, trained on over 500 million tweets. Experiments on a benchmark of tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models in Spanish. In addition to this, our model achieves top results for some English-Spanish tasks of the Linguistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Weight Decay · Attention Dropout · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing
