TL;DR
This paper introduces MM-JudgeBias, a benchmark to evaluate compositional biases in multimodal large language models acting as evaluators, revealing systematic modality neglect and evaluation asymmetries.
Contribution
It systematically defines compositional bias in MLLM-as-a-Judge and provides a comprehensive benchmark with metrics and a dataset for diagnosis.
Findings
26 state-of-the-art MLLMs show modality neglect.
Bias-Deviation and Bias-Conformity metrics reveal sensitivity and stability issues.
Over 1,800 curated samples enable detailed bias analysis.
Abstract
Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
