Lost in Translation: Do LVLM Judges Generalize Across Languages?
Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Mir Tafseer Nayeem, Amran Bhuiyan, Mizanur Rahman, Shafiq Joty, Enamul Hoque, Jimmy Huang

TL;DR
This paper introduces MM-JudgeBench, a large-scale multilingual benchmark for evaluating vision-language model judges across diverse languages, revealing significant cross-lingual performance variability and limitations of current reward models.
Contribution
The paper presents MM-JudgeBench, the first comprehensive multilingual benchmark for LVLM judges, and provides an extensive analysis of their cross-lingual robustness and limitations.
Findings
Substantial performance variance across languages in LVLM judges.
Model size and architecture poorly predict multilingual robustness.
State-of-the-art judges show inconsistent behavior across languages.
Abstract
Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
