Irish-BLiMP: A Linguistic Benchmark for Evaluating Human and Language Model Performance in a Low-Resource Setting

Josh McGiff; Khanh-Tung Tran; William Mulcahy; D\'aibhidh \'O Luin\'in; Jake Dalzell; R\'ois\'in N\'i Bhroin; Adam Burke; Barry O'Sullivan; Hoang D. Nguyen; Nikola S. Nikolov

arXiv:2510.20957·cs.CL·October 27, 2025

Irish-BLiMP: A Linguistic Benchmark for Evaluating Human and Language Model Performance in a Low-Resource Setting

Josh McGiff, Khanh-Tung Tran, William Mulcahy, D\'aibhidh \'O Luin\'in, Jake Dalzell, R\'ois\'in N\'i Bhroin, Adam Burke, Barry O'Sullivan, Hoang D. Nguyen, Nikola S. Nikolov

PDF

TL;DR

Irish-BLiMP is a new benchmark dataset for evaluating linguistic competence in Irish, comparing human and language model performance across various grammatical features, revealing significant gaps especially in models.

Contribution

This work introduces the first systematic framework and dataset for assessing Irish language understanding in both humans and language models, focusing on low-resource language challenges.

Findings

01

Humans outperform all models in Irish grammatical tasks.

02

A 18.1% performance gap exists between open- and closed-source LLMs.

03

Even the best model reaches only 73.5% accuracy compared to human performance.

Abstract

We present Irish-BLiMP (Irish Benchmark of Linguistic Minimal Pairs), the first dataset and framework designed for fine-grained evaluation of linguistic competence in the Irish language, an endangered language. Drawing on a variety of linguistic literature and grammar reference works, we manually constructed and reviewed 1020 minimal pairs across a taxonomy of 11 linguistic features, through a team of fluent Irish speakers. We evaluate both existing Large Language Models (LLMs) and fluent human participants on their syntactic knowledge of Irish. Our findings show that humans outperform all models across all linguistic features, achieving 16.6% higher accuracy on average. Moreover, a substantial performance gap of 18.1% persists between open- and closed-source LLMs, with even the strongest model (gpt-5) reaching only 73.5% accuracy compared to 90.1% by human. Interestingly, human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.