Steering at the Source: Style Modulation Heads for Robust Persona Control
Yoshihiro Izawa, Gouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura

TL;DR
This paper introduces Style Modulation Heads, a sparse set of attention heads that enable robust and precise control of LLMs' persona and style, reducing coherency issues associated with residual stream steering.
Contribution
It identifies and localizes a small subset of attention heads responsible for persona and style, enabling targeted interventions for safer and more effective model control.
Findings
Intervening on three specific heads improves control robustness.
Targeted head intervention reduces coherency degradation.
Geometric analysis effectively localizes key attention heads.
Abstract
Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Machine Learning in Healthcare · Topic Modeling
