Endogenous Resistance to Activation Steering in Language Models

Alex McKenzie; Keenan Pepper; Stijn Servaes; Martin Leitgab; Murat Cubuktepe; Mike Vaiana; Diogo de Lucena; Judd Rosenblatt; Michael S. A. Graziano

arXiv:2602.06941·cs.LG·February 9, 2026

Endogenous Resistance to Activation Steering in Language Models

Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Mike Vaiana, Diogo de Lucena, Judd Rosenblatt, Michael S. A. Graziano

PDF

Open Access

TL;DR

This paper investigates the phenomenon of endogenous resistance to activation steering in large language models, revealing how models internally resist manipulation and how this resistance can be both mitigated and enhanced through various methods.

Contribution

It identifies and causally links internal latent circuits to resistance against activation steering, and demonstrates methods to control this resistance in language models.

Findings

01

Llama-3.3-70B shows substantial endogenous steering resistance (ESR).

02

Zero-ablating 26 identified latents reduces multi-attempt resistance by 25%.

03

Prompting and training can significantly enhance ESR in models.

Abstract

Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education