MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and   Reward Models

Guijin Son; Dongkeun Yoon; Juyoung Suk; Javier Aula-Blasco; Mano; Aslan; Vu Trong Kim; Shayekh Bin Islam; Jaume Prats-Cristi\`a; Luc\'ia; Tormo-Ba\~nuelos; Seungone Kim

arXiv:2410.17578·cs.CL·April 1, 2025·2 cites

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano, Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-Cristi\`a, Luc\'ia, Tormo-Ba\~nuelos, Seungone Kim

PDF

Open Access 1 Repo 3 Datasets

TL;DR

MM-Eval is a comprehensive multilingual meta-evaluation benchmark designed to assess the reliability, consistency, and fairness of LLM-based evaluators across diverse languages, addressing limitations of English-centric benchmarks.

Contribution

The paper introduces MM-Eval, a novel multilingual meta-evaluation benchmark that evaluates LLM evaluators on multiple dimensions across 18 languages and 122 languages, focusing on non-English assessment.

Findings

01

Existing evaluators perform poorly on non-English outputs.

02

Evaluators show unfairness and inconsistency for low-resource languages.

03

MM-Eval correlates strongly with human rankings, outperforming previous benchmarks.

Abstract

As Large Language Models (LLMs) are now capable of producing fluent and coherent content in languages other than English, it is not imperative to precisely evaluate these non-English outputs. However, when assessing the outputs from mutlilingual LLMs, prior works often employed LLM based evaluators that excel at assessing English outputs, without a thorough examination of whether these evaluators could effectively assess non-English text as well. Moreover, existing benchmarks to test evaluator LLMs (referred to as "meta-evaluation benchmarks") are mostly English-centric. To bridge this gap and examine whether evaluator LLMs can reliably assess the outputs of multilingual LLMs, we introduce MM-Eval, a multilingual meta-evaluation benchmark comprising five core subsets covering 18 languages and a Language Consistency subset spanning 122 languages. A core attribute of MM-Eval is that,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guijinSON/MM-Eval
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Law, Economics, and Judicial Systems

MethodsFocus