Shifting the Gradient: Understanding How Defensive Training Methods Protect Language Model Integrity
Satchel Grant, Victor Gillioz, Jake Ward, Thomas McGrath

TL;DR
This paper compares two defensive training methods for language models, PPS and IP, revealing they operate via different mechanisms and have distinct effects on trait expression and model behavior.
Contribution
It provides the first behavioral and mechanistic comparison of PPS and IP, clarifying how each method defends against trait acquisition in language models.
Findings
PPS shifts activation gradients to attenuate trait expression.
IP's gradient signature is more diffuse and less mechanistically understood.
PPS can reduce pre-existing trait expression, unlike IP.
Abstract
Defensive training methods such as positive preventative steering (PPS) and inoculation prompting (IP) offer surprising results through seemingly similar processes: both add trait-inducing objects to large language models (LLMs) during training, and both defend the LLM against acquiring the trait. The surprising success of these methods comes with the question: how do they work? Are PPS and IP doing the same thing? We provide behavioral and mechanistic comparisons of these two methods using "evilness" as a case-study trait. Our central finding is that PPS and IP achieve their defensive benefits through distinct mechanisms. Behaviorally, we show that neither PPS nor IP operates through a purely associative mechanism; and PPS can both defend against trait acquisition and actively reduce pre-existing expression, whereas IP is ineffective in models that were previously finetuned to express…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
