Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants
Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

TL;DR
This paper assesses how well current language models understand Basque and Spanish dialects using NLI, revealing significant performance drops especially for Basque variants, and provides a new dataset for future research.
Contribution
Introduces a novel parallel dataset for Basque and Spanish variants and analyzes the impact of linguistic variation on LLM performance in NLI tasks.
Findings
Performance drops in LLMs when handling linguistic variation
Encoder-only models struggle more with Western Basque
Variation impacts model understanding beyond lexical overlap
Abstract
In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Phonetics and Phonology Research · Speech Recognition and Synthesis
