Holmes: A Benchmark to Assess the Linguistic Competence of Language Models

Andreas Waldis; Yotam Perlitz; Leshem Choshen; Yufang Hou; Iryna Gurevych

arXiv:2404.18923·cs.CL·May 12, 2026

Holmes: A Benchmark to Assess the Linguistic Competence of Language Models

Andreas Waldis, Yotam Perlitz, Leshem Choshen, Yufang Hou, Iryna Gurevych

PDF

TL;DR

Holmes is a comprehensive benchmark that evaluates language models' unconscious linguistic understanding across various phenomena using probing techniques.

Contribution

It introduces Holmes, a benchmark with extensive datasets and analysis methods to disentangle linguistic competence from other cognitive abilities.

Findings

01

Model size correlates with linguistic competence.

02

Architecture and instruction tuning significantly affect performance.

03

FlashHolmes reduces computation while maintaining accuracy.

Abstract

We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs' internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.