TL;DR
The #Somos600M Project aims to develop open-source NLP resources to represent the linguistic diversity of Spanish-speaking regions, addressing the lack of datasets and benchmarks for instruction-tuning and evaluating large language models in these languages.
Contribution
We created the first open instruction and evaluation datasets for Spanish dialects from LATAM, the Caribbean, and Spain, supporting NLP development for these diverse languages.
Findings
First open instruction dataset for Spanish dialects
Initial evaluation benchmarks for Spanish NLP models
Community-driven dataset creation approach
Abstract
We are 600 million Spanish speakers. We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs), nor a leaderboard to evaluate and compare them. In this paper, we present how we have created as an international open-source community the first versions of the instruction and evaluation datasets, indispensable resources for the advancement of Natural Language Processing (NLP) in our languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
