MLLM-as-a-Judge Exhibits Model Preference Bias
Shuitsu Koyama, Yuiga Wada, Daichi Yashima, and Komei Sugiura

TL;DR
This paper investigates biases in multimodal large language models used as evaluators, revealing tendencies toward self-preference and mutual bias, and proposes a simple ensemble method to mitigate these biases.
Contribution
The study introduces Philautia-Eval to measure model bias, analyzes bias patterns across 12 MLLMs, and proposes Pomms, an ensemble approach that reduces bias while preserving performance.
Findings
Representative MLLMs tend to favor their own outputs.
Mutual preference bias exists within certain model families.
Pomms ensemble reduces bias without sacrificing performance.
Abstract
Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
