On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation
Andy Han, Kristina Fujimoto, Avidan Shah, Kiet Nguyen, Kai Xu, Chen Yueh-Han, Ilia Sucholutsky, Rico Angell

TL;DR
On-Policy Consistency Training (OPCT) enhances Large Language Model safety across multiple axes with minimal capability loss by training models on their own responses conditioned on contrastive prompts.
Contribution
The paper introduces OPCT, a novel on-policy consistency training method that outperforms traditional supervised fine-tuning in safety and robustness without significant capability degradation.
Findings
OPCT nearly halves sycophancy rate compared to baseline.
OPCT achieves near 99% jailbreak defense success on held-out behaviors.
OPCT avoids capability regressions seen in supervised fine-tuning.
Abstract
Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such failures by training invariants into the model using contrastive input pairs. Existing consistency training procedures generate the supervision signal once, offline, and use supervised fine-tuning (SFT) to update the model. Unfortunately, the resulting models tend to merely memorize the surface forms of the training distribution and thus generalize poorly and regress in their capabilities. We introduce On-Policy Consistency Training (OPCT), a new consistency training approach where the objective is computed over the model's own responses to prompts, supervised by itself conditioned on corresponding contrastive prompts. We evaluate OPCT on three safety axes:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
