Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices

\v{S}pela Vintar; Taja Kuzman Punger\v{s}ek; Mojca Brglez; Nikola Ljube\v{s}i\'c

arXiv:2510.24450·cs.CL·November 5, 2025

Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices

\v{S}pela Vintar, Taja Kuzman Punger\v{s}ek, Mojca Brglez, Nikola Ljube\v{s}i\'c

PDF

TL;DR

This paper reviews recent LLM benchmarking developments, introduces a taxonomy tailored for multilingual and European languages, and proposes best practices to improve evaluation sensitivity to language and culture.

Contribution

It presents a new taxonomy for LLM benchmarks focused on multilingual and European languages and suggests best practices for more culturally sensitive evaluations.

Findings

01

Proposed a taxonomy for multilingual LLM benchmarks

02

Recommended best practices for evaluation standards

03

Highlighted the need for cultural sensitivity in assessments

Abstract

While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.