MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
Jaeyun Lee, Junyoung Koh, Zeynel Tok, Hunar Batra, Ronald Clark

TL;DR
MCJudgeBench is a new benchmark designed to evaluate how well language model judges assess multi-constraint responses, revealing nuanced reliability issues and the importance of constraint-level evaluation.
Contribution
It introduces a benchmark with constraint-level labels, perturbation protocols, and evaluation metrics to analyze judge reliability and inconsistency in multi-constraint instruction following.
Findings
Judge reliability varies across label categories, especially for partial and no labels.
Higher correctness does not always mean lower inconsistency.
Reasoning-based evaluation improves correctness but not stability.
Abstract
Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
