On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

Andy Han; Kristina Fujimoto; Avidan Shah; Kiet Nguyen; Kai Xu; Chen Yueh-Han; Ilia Sucholutsky; Rico Angell

arXiv:2605.21834·cs.LG·May 22, 2026

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

Andy Han, Kristina Fujimoto, Avidan Shah, Kiet Nguyen, Kai Xu, Chen Yueh-Han, Ilia Sucholutsky, Rico Angell

PDF

TL;DR

On-Policy Consistency Training (OPCT) enhances Large Language Model safety across multiple axes with minimal capability loss by training models on their own responses conditioned on contrastive prompts.

Contribution

The paper introduces OPCT, a novel on-policy consistency training method that outperforms traditional supervised fine-tuning in safety and robustness without significant capability degradation.

Findings

01

OPCT nearly halves sycophancy rate compared to baseline.

02

OPCT achieves near 99% jailbreak defense success on held-out behaviors.

03

OPCT avoids capability regressions seen in supervised fine-tuning.

Abstract

Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such failures by training invariants into the model using contrastive input pairs. Existing consistency training procedures generate the supervision signal once, offline, and use supervised fine-tuning (SFT) to update the model. Unfortunately, the resulting models tend to merely memorize the surface forms of the training distribution and thus generalize poorly and regress in their capabilities. We introduce On-Policy Consistency Training (OPCT), a new consistency training approach where the objective is computed over the model's own responses to prompts, supervised by itself conditioned on corresponding contrastive prompts. We evaluate OPCT on three safety axes:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.