Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan; Alexander Matt Turner; Mark Kurzeja; David K. Elson; Rohin Shah

arXiv:2510.27062·cs.LG·November 3, 2025

Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah

PDF

Open Access 3 Reviews

TL;DR

This paper introduces consistency training methods to improve large language models' robustness against prompt manipulations like sycophancy and jailbreaks by enforcing invariance in responses.

Contribution

The paper proposes two consistency training techniques, BCT and ACT, to enhance model robustness and reduce susceptibility to irrelevant prompt cues.

Findings

01

Both methods reduce sycophancy effectively.

02

BCT outperforms in jailbreak mitigation.

03

Consistency training avoids issues of stale data.

Abstract

An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (\emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (\emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The suppression of jailbreaks and sycophancy is an important issue. The authors have effectively suppressed these two behaviors through the two proposed methods, and conducted experiments on both open-source models and frontier models, which demonstrates the feasibility of the methods.

Weaknesses

The authors conducted all evaluations exclusively on MMLU. As a relatively old test set that focuses on scientific fields, it seems insufficient to validate the new model's reliability. While the authors addressed consistency, whether the model's instruction-following ability has declined is a concern of mine. This is because the authors' data construction approach appears likely to cause the model to ignore some risk-free instruction prompts.

Reviewer 02Rating 6Confidence 4

Strengths

- **S1**: Compares activation-aware and blackbox methods using fixed datasets to improve the consistency of models against adversarial/sycophancy-inducing cues. - **S2**: Details the scientific process that led to the techniques: presents the motivation for using white box methods with an activation patching experiment. - **S3**: Goes beyond performance differences between the methods and investigates the internal/behavioral differences of both models. - **S4**: The paper clearly states many lim

Weaknesses

- **W1** I found the usefulness evaluation quite narrow and not attempting to identify the potential side effects of the technique: - For sycophancy, I would have liked to see a useful cue dataset. For example, the user suggests a way of solving a problem to help the model, or adds "I'm an ML researcher, summarize this paper for me" vs "I'm a high school student with no ML background, can you summarize this paper for me?" to ensure the model is able to use those cues to articulate its response

Reviewer 03Rating 4Confidence 3

Strengths

Given ICLRs focus, the paper is short on a theoretical justification for why activation consistency should improve robustnes. The mixed results on capability staleness (strong evidence for sycophancy but not jailbreaks) suggest the benefits of fresh data may be context-dependent in ways the authors don't fully explain. ACT's underperformance on jailbreaks relative to BCT, combined with increased helpfulness degradation from both methods. This raises questions about practical deployability. The f

Weaknesses

The paper shows a practical defense against known attacks. It cannot be seen as a solution towards the alignment problem in the sense that it cannot provide strong guarantees. True alignment likely requires the model to internalize values during pretraining, not just behavioral conditioning during fine-tuning.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)