Context Over Content: Exposing Evaluation Faking in Automated Judges

Manan Gupta; Inderjeet Nair; Lu Wang; Dhruv Kumar

arXiv:2604.15224·cs.AI·April 17, 2026

Context Over Content: Exposing Evaluation Faking in Automated Judges

Manan Gupta, Inderjeet Nair, Lu Wang, Dhruv Kumar

PDF

TL;DR

This paper reveals that automated AI judges are vulnerable to 'stakes signaling', where subtle contextual cues influence their evaluations without explicit awareness, leading to biased safety assessments.

Contribution

It introduces a controlled experimental framework to measure evaluation faking and demonstrates the existence of implicit leniency bias in LLM-based judges.

Findings

01

Judges soften verdicts when informed of negative consequences for the evaluated model.

02

Bias causes up to a 30% relative drop in unsafe-content detection.

03

Chain-of-thought analysis fails to detect the implicit bias.

Abstract

The $LLM-as-a-judge$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $stakes signaling$ , a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.