Linguini: A benchmark for language-agnostic linguistic reasoning
Eduardo S\'anchez, Belen Alastruey, Christophe Ropers, Pontus, Stenetorp, Mikel Artetxe, Marta R. Costa-juss\`a

TL;DR
Linguini introduces a new linguistic reasoning benchmark for low-resource languages, revealing current models' limited ability to solve language-agnostic puzzles without prior language knowledge.
Contribution
The paper presents a novel benchmark with 894 questions across 75 low-resource languages to evaluate language models' linguistic reasoning skills without prior language-specific training.
Findings
All models perform below 25% accuracy.
Proprietary models outperform open models significantly.
The benchmark exposes gaps in language-agnostic reasoning capabilities.
Abstract
We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.
Peer Reviews
Decision·Submitted to ICLR 2025
Linguistic reasoning is an exciting way to analyze the capabilities of LLMs, combining several important skills (e.g., reasoning, language skills). The creation of the benchmark is well-motivated and sound. The authors evaluated a large number of LLMs, with some interesting findings (e.g., the difference between open and closed models, the results of the character substitution experiment).
The key weakness of the paper is that there already exists a benchmark for testing the linguistic reasoning skills of LLMs, specifically LINGOLY ([Bean et al., 2024](https://arxiv.org/abs/2406.06196)), also comprised of puzzles from the Linguistic Olympiad. It seems that LINGOLY has a wider scope than Linguini, so it is not clear what Linguini adds beyond LINGOLY. The authors also do not discuss the difference between their benchmark and LINGOLY -- in fact, they do not even mention LINGOLY. (I a
1. The experiments were conducted using a diverse set of 13 open and closed LLMs, enhancing the generalizability of the conclusions. 2. The selection of low-resource languages unlikely to appear in pre-training data and the dataset design that focuses on pure linguistic reasoning abilities are interesting and present a novel idea. 3. The benchmark was rigorously tested from various perspectives, providing a high-quality resource for the community: - Experiments without context indicate that
1. It is unclear how low-resource the selected 75 languages are for the language models used in the experiments. Although it is likely that they are low-resource, this assumption alone does not ensure reliability in the experimental results. Since, for some of the models, pre-training data is publicly available, verifying the presence of these languages in the pre-training data would strengthen the paper. 2. The motivation for putting the constraint that models must learn and utilize the charact
1. The introduced benchmark is fairly challenging, given the SoTA performance is below 25%. 2. The experiments are very scientifically conducted and thorough. For example, “no context prompting” experiment in 5.1 showed evidence of lack of presence of language data in models training. Another example is 5.3, which shows that unless a language is higher on the resource scale, scores remain low. 3. The paper is well written, and makes for an interesting read.
1. The goal of this work is to create a benchmark to evaluate linguistic skills of the model (unrelated to language specific learning). It would be good to fully understand why this is an important problem? Can a toy dataset be built instead, something that tests linguistic abilities, but isn’t a real language? 2. The paper is well written, but it would be good to improve some areas, such as: - The related work could use more detail. For instance, it’d be important to add information about very
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
