Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation
Zeyu Chen, Huanjin Yao, Ziwang Zhao, Min Yang

TL;DR
This paper introduces M-JudgeBench, a comprehensive benchmark for evaluating multimodal large language models as judges, and proposes Judge-MCTS, a data generation framework to improve judge model reliability and capabilities.
Contribution
It presents a new ten-dimensional, capability-oriented benchmark and a MCTS-driven data construction method to enhance the evaluation and training of MLLM-based judge models.
Findings
M-JudgeBench reveals weaknesses in existing judge models.
Judge-MCTS improves judge model performance and reliability.
M-JudgeBench and Judge-MCTS outperform previous benchmarks and methods.
Abstract
Using Multimodal Large Language Models (MLLMs) as judges to achieve precise and consistent evaluations has gradually become an emerging paradigm across various domains. Evaluating the capability and reliability of MLLM-as-a-judge systems is therefore essential for ensuring trustworthy assessment. Existing judge benchmarks categorize samples by task types but fail to capture the fundamental judgment capabilities required for reliable evaluation. In this work, we introduce M-JudgeBench, a ten-dimensional capability-oriented benchmark designed to comprehensively assess the judgment abilities of MLLMs. Our benchmark decomposes evaluation into pairwise Chain-of-Thought (CoT) comparison, length bias avoidance, and process error detection tasks, jointly covering ten fine-grained subtasks. This design enables diagnosis of model reliability across reasoning styles, response lengths, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
