Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages

Tja\v{s}a Ar\v{c}on (1); Matej Klemen (1); Marko Robnik-\v{S}ikonja (1); Kaja Dobrovoljc (1; 2; 3) ((1) University of Ljubljana; Faculty of Computer; Information Science; Slovenia (2) University of Ljubljana; Faculty of Arts; Slovenia; (3) Jo\v{z}ef Stefan Institute; Ljubljana; Slovenia)

arXiv:2602.02182·cs.CL·February 13, 2026

Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages

Tja\v{s}a Ar\v{c}on (1), Matej Klemen (1), Marko Robnik-\v{S}ikonja (1), Kaja Dobrovoljc (1, 2, 3) ((1) University of Ljubljana, Faculty of Computer, Information Science, Slovenia (2) University of Ljubljana, Faculty of Arts, Slovenia, (3) Jo\v{z}ef Stefan Institute, Ljubljana

PDF

Open Access

TL;DR

This paper evaluates the metalinguistic knowledge of large language models across 2,660 languages using the WALS dataset, revealing limited understanding that is heavily influenced by data availability and digital presence.

Contribution

It introduces a multilingual benchmark based on WALS features to assess LLMs' explicit linguistic knowledge across diverse languages.

Findings

01

GPT-4o achieves moderate accuracy (0.367)

02

Models perform above chance but below majority-class baseline

03

Performance correlates with digital language resources

Abstract

LLMs are routinely evaluated on language use, yet their explicit knowledge about linguistic structure remains poorly understood. Existing linguistic benchmarks focus on narrow phenomena, emphasize high-resource languages, and rarely test metalinguistic knowledge - explicit reasoning about language structure. We present a multilingual evaluation of metalinguistic knowledge in LLMs, based on the World Atlas of Language Structures (WALS), documenting 192 linguistic features across 2,660 languages. We convert WALS features into natural-language multiple-choice questions and evaluate models across documented languages. Using accuracy and macro F1, and comparing to chance and majority-class baselines, we assess performance and analyse variation across linguistic domains and language-related factors. Results show limited metalinguistic knowledge: GPT-4o performs best but achieves moderate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Multilingual Education and Policy