How Reliable is Multilingual LLM-as-a-Judge?

Xiyan Fu; Wei Liu

arXiv:2505.12201·cs.CL·May 20, 2025

How Reliable is Multilingual LLM-as-a-Judge?

Xiyan Fu, Wei Liu

PDF

Open Access

TL;DR

This paper evaluates the reliability of multilingual LLMs as judges for assessing generated content across diverse languages, revealing significant inconsistencies especially in low-resource languages, and proposes an ensemble approach to improve judgment consistency.

Contribution

It provides a comprehensive analysis of multilingual LLMs as evaluators, highlighting their limitations and introducing an ensemble strategy to enhance evaluation reliability.

Findings

01

LLMs show low consistency in multilingual judgment (average Fleiss' Kappa ~0.3).

02

Performance varies greatly across languages, especially in low-resource ones.

03

Ensemble strategies can improve judgment consistency in multilingual evaluation.

Abstract

LLM-as-a-Judge has emerged as a popular evaluation strategy, where advanced large language models assess generation results in alignment with human instructions. While these models serve as a promising alternative to human annotators, their reliability in multilingual evaluation remains uncertain. To bridge this gap, we conduct a comprehensive analysis of multilingual LLM-as-a-Judge. Specifically, we evaluate five models from different model families across five diverse tasks involving 25 languages. Our findings reveal that LLMs struggle to achieve consistent judgment results across languages, with an average Fleiss' Kappa of approximately 0.3, and some models performing even worse. To investigate the cause of inconsistency, we analyze various influencing factors. We observe that consistency varies significantly across languages, with particularly poor performance in low-resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Authorship Attribution and Profiling