But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

Leon Eshuijs; Archie Chaudhury; Alan McBeth; Ethan Nguyen

arXiv:2505.17760·cs.LG·April 2, 2026

But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

Leon Eshuijs, Archie Chaudhury, Alan McBeth, Ethan Nguyen

PDF

TL;DR

This paper introduces JUSSA, a framework that uses internal model representations to generate contrastive honest alternatives, improving the detection of dishonesty in LLM-based judges.

Contribution

JUSSA leverages internal representations to create honesty-promoting steering vectors, enhancing the ability of LLM judges to detect subtle dishonesty through contrastive evaluation.

Findings

01

AUROC improved from 0.893 to 0.946 on GPT-4.1 judges

02

AUROC improved from 0.859 to 0.929 on Claude Haiku judges

03

Performance drops when task complexity exceeds judge capability

Abstract

LLM-as-a-judge is widely used as a scalable substitute for human evaluation, yet current approaches rely on black-box access and struggle to detect subtle dishonesty, such as sycophancy and manipulation. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a framework that leverages a model's internal representations to optimize an honesty-promoting steering vector from a single training example, generating contrastive alternatives that give judges a reference point for detecting dishonesty. We test JUSSA on a novel manipulation benchmark with human-validated response pairs at varying dishonesty levels, finding AUROC improvements across both GPT-4.1 (0.893 $\to$ 0.946) and Claude Haiku (0.859 $\to$ 0.929) judges, though performance degrades when task complexity is mismatched to judge capability, suggesting contrastive evaluation helps most when the task is challenging but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.