JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

Rohith Reddy Bellibatlu; Edward Raff; Wenbin Zhang

arXiv:2604.23478·cs.CL·May 11, 2026

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

Rohith Reddy Bellibatlu, Edward Raff, Wenbin Zhang

PDF

1 Datasets

TL;DR

This paper introduces JudgeSense, a benchmark for evaluating prompt sensitivity in LLM-based judges, revealing that model size does not guarantee decision stability across paraphrased prompts.

Contribution

The paper presents JudgeSense, a comprehensive benchmark with hand-validated prompt paraphrases to systematically assess judge stability across tasks and models.

Findings

01

Coherence is the most distinguishing task for judge behavior.

02

Factuality judgments are generally stable under standard conditions.

03

Model scale does not reliably predict decision consistency.

Abstract

Large language models are widely adopted as automated evaluation judges, yet the stability of their verdicts under semantically equivalent prompt rephrasings remains largely unexamined. We conduct a systematic empirical study of prompt-induced decision instability across multiple evaluation tasks and judge architectures. To facilitate this analysis, we release JudgeSense, a benchmark comprising hand-validated prompt-paraphrase pairs spanning factuality, coherence, relevance, and preference, drawn from established NLP benchmarks and accompanied by comprehensive decision logs. The benchmark enables the measurement of judge stability across equivalent prompts, allowing researchers to assess whether stability correlates with model scale or instruction-tuning, and to identify which tasks are most sensitive to prompt wording. Our evaluation reveals that coherence remains the primary task for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Rohithreddybc/judgesense-benchmark
dataset· 186 dl
186 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.