Regionalized models for Spanish language variations based on Twitter
Eric S. Tellez, Daniela Moctezuma, Sabino Miranda, Mario, Graff, Guillermo Ruiz

TL;DR
This paper develops regionalized Spanish language models using Twitter data from 26 countries, enhancing understanding of local language variations for improved regional NLP tasks.
Contribution
Introduces regionalized Spanish language resources including embeddings, BERT models, and corpora, with comprehensive regional comparison and application examples.
Findings
Regional language models improve regional NLP task performance.
Lexical and semantic differences vary significantly across regions.
Regional resources enable more accurate message classification.
Abstract
Spanish is one of the most spoken languages in the globe, but not necessarily Spanish is written and spoken in the same way in different countries. Understanding local language variations can help to improve model performances on regional tasks, both understanding local structures and also improving the message's content. For instance, think about a machine learning engineer who automatizes some language classification task on a particular region or a social scientist trying to understand a regional event with echoes on social media; both can take advantage of dialect-based language models to understand what is happening with more contextual information hence more precision. This manuscript presents and describes a set of regionalized resources for the Spanish language built on four-year Twitter public messages geotagged in 26 Spanish-speaking countries. We introduce word embeddings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗guillermoruiz/bilma_MXmodel· 10 dl10 dl
- 🤗guillermoruiz/bilma_ARmodel· 11 dl11 dl
- 🤗guillermoruiz/bilma_CLmodel· 6 dl6 dl
- 🤗guillermoruiz/bilma_COmodel· 8 dl8 dl
- 🤗guillermoruiz/bilma_ESmodel· 5 dl5 dl
- 🤗guillermoruiz/bilma_USmodel· 5 dl5 dl
- 🤗guillermoruiz/bilma_UYmodel· 3 dl3 dl
- 🤗guillermoruiz/bilma_VEmodel· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpanish Linguistics and Language Studies · Linguistic Variation and Morphology · Linguistics, Language Diversity, and Identity
MethodsAttention Is All You Need · Linear Layer · Adam · Multi-Head Attention · Layer Normalization · Residual Connection · Dense Connections · Attention Dropout · Softmax · WordPiece
