GaelEval: Benchmarking LLM Performance for Scottish Gaelic
Peter Devine, William Lamb, Beatrice Alex, Ignatius Ezeani, Dawn Knight, M\'iche\'al J. \'O Meachair, Paul Rayson, Martin Wynne

TL;DR
GaelEval is a comprehensive benchmark for assessing multilingual LLMs' performance on Scottish Gaelic, revealing models' strengths and weaknesses across linguistic, translation, and cultural tasks.
Contribution
This paper introduces GaelEval, the first multi-dimensional Gaelic benchmark, and provides extensive evaluation of 19 LLMs, highlighting performance gaps and prompting effects.
Findings
Gemini 3 Pro Preview achieves 83.3% accuracy on linguistic tasks.
Proprietary models outperform open-weight systems.
Gaelic prompting provides a small but consistent performance advantage.
Abstract
Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline (), we find that Gemini 3 Pro Preview achieves accuracy on the linguistic task, surpassing the human baseline (). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
