Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs
Yoshifumi Kawasaki

TL;DR
This paper investigates how well large language models capture regional lexical differences in Spanish, revealing systematic biases and varying accuracy across dialects using a large-scale, expert-curated lexical database.
Contribution
It provides a comprehensive, large-scale evaluation of dialectal lexical knowledge in LLMs for Spanish, highlighting biases beyond data volume influences.
Findings
Models recognize Spain and Mexico dialects more accurately.
Chilean dialect is particularly challenging for models.
Differences are not solely due to data quantity.
Abstract
This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico & Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistic Variation and Morphology · Language and cultural evolution · Linguistics, Language Diversity, and Identity
