But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors
Leon Eshuijs, Archie Chaudhury, Alan McBeth, Ethan Nguyen

TL;DR
This paper introduces JUSSA, a framework that uses internal model representations to generate contrastive honest alternatives, improving the detection of dishonesty in LLM-based judges.
Contribution
JUSSA leverages internal representations to create honesty-promoting steering vectors, enhancing the ability of LLM judges to detect subtle dishonesty through contrastive evaluation.
Findings
AUROC improved from 0.893 to 0.946 on GPT-4.1 judges
AUROC improved from 0.859 to 0.929 on Claude Haiku judges
Performance drops when task complexity exceeds judge capability
Abstract
LLM-as-a-judge is widely used as a scalable substitute for human evaluation, yet current approaches rely on black-box access and struggle to detect subtle dishonesty, such as sycophancy and manipulation. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a framework that leverages a model's internal representations to optimize an honesty-promoting steering vector from a single training example, generating contrastive alternatives that give judges a reference point for detecting dishonesty. We test JUSSA on a novel manipulation benchmark with human-validated response pairs at varying dishonesty levels, finding AUROC improvements across both GPT-4.1 (0.893 0.946) and Claude Haiku (0.859 0.929) judges, though performance degrades when task complexity is mismatched to judge capability, suggesting contrastive evaluation helps most when the task is challenging but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
