Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Finnur \'Ag\'ust Ingimundarson; Steinunn Rut Fri{\dh}riksd\'ottir; Bjarki \'Armannsson; Iris Edda Nowenstein; Stein{\th}\'or Steingr\'imsson

arXiv:2603.16406·cs.CL·March 18, 2026

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Finnur \'Ag\'ust Ingimundarson, Steinunn Rut Fri{\dh}riksd\'ottir, Bjarki \'Armannsson, Iris Edda Nowenstein, Stein{\th}\'or Steingr\'imsson

PDF

Open Access

TL;DR

This paper critically examines LLM evaluation methods for Icelandic, highlighting issues with synthetic and machine-translated benchmarks that compromise validity, and advocates for improved, verified evaluation practices in low-resource language settings.

Contribution

It identifies flaws in current Icelandic LLM benchmarks, especially those using unverified synthetic data, and emphasizes the need for better verification methods in low-resource language evaluation.

Findings

01

Synthetic and machine-translated benchmarks often contain flawed test examples.

02

Verified human-authored or translated benchmarks show more reliable results.

03

Current benchmarks may significantly skew LLM evaluation outcomes.

Abstract

This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling