Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Tianyi Xiong; Yi Ge; Ming Li; Zuolong Zhang; Pranav Kulkarni; Kaishen Wang; Qi He; Zeying Zhu; Chenxi Liu; Ruibo Chen; Tong Zheng; Yanshuo Chen; Xiyao Wang; Renrui Zhang; Wenhu Chen; Heng Huang

arXiv:2511.21662·cs.CV·March 16, 2026

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, Heng Huang

PDF

Open Access

TL;DR

Multi-Crit introduces a comprehensive benchmark for evaluating multimodal models' ability to follow diverse evaluation criteria, revealing current limitations and guiding future improvements in multimodal AI judging systems.

Contribution

It develops a novel benchmark with new metrics to systematically assess pluralistic adherence and criterion-level judgment in multimodal models, addressing a key gap in evaluation methods.

Findings

01

Proprietary models struggle with consistent criterion adherence.

02

Open-source models lag in flexible criterion following.

03

Fine-tuning improves visual grounding but not pluralistic judgment.

Abstract

Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning