MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano, Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-Cristi\`a, Luc\'ia, Tormo-Ba\~nuelos, Seungone Kim

TL;DR
MM-Eval is a comprehensive multilingual meta-evaluation benchmark designed to assess the reliability, consistency, and fairness of LLM-based evaluators across diverse languages, addressing limitations of English-centric benchmarks.
Contribution
The paper introduces MM-Eval, a novel multilingual meta-evaluation benchmark that evaluates LLM evaluators on multiple dimensions across 18 languages and 122 languages, focusing on non-English assessment.
Findings
Existing evaluators perform poorly on non-English outputs.
Evaluators show unfairness and inconsistency for low-resource languages.
MM-Eval correlates strongly with human rankings, outperforming previous benchmarks.
Abstract
As Large Language Models (LLMs) are now capable of producing fluent and coherent content in languages other than English, it is not imperative to precisely evaluate these non-English outputs. However, when assessing the outputs from mutlilingual LLMs, prior works often employed LLM based evaluators that excel at assessing English outputs, without a thorough examination of whether these evaluators could effectively assess non-English text as well. Moreover, existing benchmarks to test evaluator LLMs (referred to as "meta-evaluation benchmarks") are mostly English-centric. To bridge this gap and examine whether evaluator LLMs can reliably assess the outputs of multilingual LLMs, we introduce MM-Eval, a multilingual meta-evaluation benchmark comprising five core subsets covering 18 languages and a Language Consistency subset spanning 122 languages. A core attribute of MM-Eval is that,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Law, Economics, and Judicial Systems
MethodsFocus
