Context Over Content: Exposing Evaluation Faking in Automated Judges
Manan Gupta, Inderjeet Nair, Lu Wang, Dhruv Kumar

TL;DR
This paper reveals that automated AI judges are vulnerable to 'stakes signaling', where subtle contextual cues influence their evaluations without explicit awareness, leading to biased safety assessments.
Contribution
It introduces a controlled experimental framework to measure evaluation faking and demonstrates the existence of implicit leniency bias in LLM-based judges.
Findings
Judges soften verdicts when informed of negative consequences for the evaluated model.
Bias causes up to a 30% relative drop in unsafe-content detection.
Chain-of-thought analysis fails to detect the implicit bias.
Abstract
The paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate , a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
