JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
Rohith Reddy Bellibatlu, Edward Raff, Wenbin Zhang

TL;DR
This paper introduces JudgeSense, a benchmark for evaluating prompt sensitivity in LLM-based judges, revealing that model size does not guarantee decision stability across paraphrased prompts.
Contribution
The paper presents JudgeSense, a comprehensive benchmark with hand-validated prompt paraphrases to systematically assess judge stability across tasks and models.
Findings
Coherence is the most distinguishing task for judge behavior.
Factuality judgments are generally stable under standard conditions.
Model scale does not reliably predict decision consistency.
Abstract
Large language models are widely adopted as automated evaluation judges, yet the stability of their verdicts under semantically equivalent prompt rephrasings remains largely unexamined. We conduct a systematic empirical study of prompt-induced decision instability across multiple evaluation tasks and judge architectures. To facilitate this analysis, we release JudgeSense, a benchmark comprising hand-validated prompt-paraphrase pairs spanning factuality, coherence, relevance, and preference, drawn from established NLP benchmarks and accompanied by comprehensive decision logs. The benchmark enables the measurement of judge stability across equivalent prompts, allowing researchers to assess whether stability correlates with model scale or instruction-tuning, and to identify which tasks are most sensitive to prompt wording. Our evaluation reveals that coherence remains the primary task for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
