TL;DR
This paper introduces a novel activation steering method for large language models that models their inference as a locally-linear dynamical system, enabling feedback control for precise, robust alignment without retraining.
Contribution
It demonstrates that transformer layer dynamics are locally linear, allowing the adaptation of linear quadratic regulators for effective, online activation steering in LLMs.
Findings
Achieves state-of-the-art modulation of toxicity, truthfulness, and refusal in LLMs.
Provides theoretical bounds on setpoint tracking error.
Outperforms baseline steering methods across models and tasks.
Abstract
Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
