Towards Understanding Steering Strength
Magamed Taimeskhanov, Samuel Vaiter, Damien Garreau

TL;DR
This paper provides the first theoretical analysis of steering strength in large language models, revealing complex effects on model behavior and validating predictions across various models.
Contribution
It introduces a novel theoretical framework for understanding how steering strength influences language model outputs, highlighting non-monotonic effects.
Findings
Steering strength affects next token probability and concept presence.
Non-monotonic effects observed with varying steering strength.
Theoretical laws accurately predict empirical behaviors.
Abstract
A popular approach to post-training control of large language models (LLMs) is the steering of intermediate latent representations. Namely, identify a well-chosen direction depending on the task at hand and perturbs representations along this direction at inference time. While many propositions exist to pick this direction, considerably less is understood about how to choose the magnitude of the move, whereas its importance is clear: too little and the intended behavior does not emerge, too much and the model's performance degrades beyond repair. In this work, we propose the first theoretical analysis of steering strength. We characterize its effect on next token probability, presence of a concept, and cross-entropy, deriving precise qualitative laws governing these quantities. Our analysis reveals surprising behaviors, including non-monotonic effects of steering strength. We validate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Explainable Artificial Intelligence (XAI)
