MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

Jaeyun Lee; Junyoung Koh; Zeynel Tok; Hunar Batra; Ronald Clark

arXiv:2605.03858·cs.CL·May 6, 2026

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

Jaeyun Lee, Junyoung Koh, Zeynel Tok, Hunar Batra, Ronald Clark

PDF

TL;DR

MCJudgeBench is a new benchmark designed to evaluate how well language model judges assess multi-constraint responses, revealing nuanced reliability issues and the importance of constraint-level evaluation.

Contribution

It introduces a benchmark with constraint-level labels, perturbation protocols, and evaluation metrics to analyze judge reliability and inconsistency in multi-constraint instruction following.

Findings

01

Judge reliability varies across label categories, especially for partial and no labels.

02

Higher correctness does not always mean lower inconsistency.

03

Reasoning-based evaluation improves correctness but not stability.

Abstract

Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.