The #Somos600M Project: Generating NLP resources that represent the   diversity of the languages from LATAM, the Caribbean, and Spain

Mar\'ia Grandury

arXiv:2407.17479·cs.CL·July 26, 2024

The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

Mar\'ia Grandury

PDF

1 Video

TL;DR

The #Somos600M Project aims to develop open-source NLP resources to represent the linguistic diversity of Spanish-speaking regions, addressing the lack of datasets and benchmarks for instruction-tuning and evaluating large language models in these languages.

Contribution

We created the first open instruction and evaluation datasets for Spanish dialects from LATAM, the Caribbean, and Spain, supporting NLP development for these diverse languages.

Findings

01

First open instruction dataset for Spanish dialects

02

Initial evaluation benchmarks for Spanish NLP models

03

Community-driven dataset creation approach

Abstract

We are 600 million Spanish speakers. We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs), nor a leaderboard to evaluate and compare them. In this paper, we present how we have created as an international open-source community the first versions of the instruction and evaluation datasets, indispensable resources for the advancement of Natural Language Processing (NLP) in our languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain· underline