Minimizing Collateral Damage in Activation Steering
Tam Nguyen, Tu Anh Nguyen, Sina Alemohammad, Richard G. Baraniuk

TL;DR
This paper introduces a mathematically grounded framework for activation steering in LLMs that minimizes collateral damage by considering the empirical second-moment of activations, leading to more precise control.
Contribution
It formalizes collateral damage, models steering as a constrained optimization, and proposes a method that reduces unintended activation changes by weighting feature directions.
Findings
Reduces unintended changes in non-target features during activation steering.
Balances target feature alignment with preservation of unrelated task performance.
Provides a principled, second-moment-aware approach to minimize collateral damage.
Abstract
Activation steering is a method for controlling Large Language Model (LLM) behavior by intervening in its internal representations to increase the alignment with a specific target feature direction. However, standard interventions, such as vector addition, often cause ``collateral damage", defined as unintended changes in the alignment of activations along other non-target feature directions. This damage occurs because standard methods implicitly assume the isotropy of non-target features. In this work, we provide a mathematical formalization of collateral damage and introduce a principled framework that models steering as a constrained optimization problem. Our method finds a new activation that minimizes the expected squared collateral change weighted by the empirical second-moment matrix of activations. This weighting encodes the nonuniform cost of the perturbation in different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
