How Reliable is Multilingual LLM-as-a-Judge?
Xiyan Fu, Wei Liu

TL;DR
This paper evaluates the reliability of multilingual LLMs as judges for assessing generated content across diverse languages, revealing significant inconsistencies especially in low-resource languages, and proposes an ensemble approach to improve judgment consistency.
Contribution
It provides a comprehensive analysis of multilingual LLMs as evaluators, highlighting their limitations and introducing an ensemble strategy to enhance evaluation reliability.
Findings
LLMs show low consistency in multilingual judgment (average Fleiss' Kappa ~0.3).
Performance varies greatly across languages, especially in low-resource ones.
Ensemble strategies can improve judgment consistency in multilingual evaluation.
Abstract
LLM-as-a-Judge has emerged as a popular evaluation strategy, where advanced large language models assess generation results in alignment with human instructions. While these models serve as a promising alternative to human annotators, their reliability in multilingual evaluation remains uncertain. To bridge this gap, we conduct a comprehensive analysis of multilingual LLM-as-a-Judge. Specifically, we evaluate five models from different model families across five diverse tasks involving 25 languages. Our findings reveal that LLMs struggle to achieve consistent judgment results across languages, with an average Fleiss' Kappa of approximately 0.3, and some models performing even worse. To investigate the cause of inconsistency, we analyze various influencing factors. We observe that consistency varies significantly across languages, with particularly poor performance in low-resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Authorship Attribution and Profiling
