Steer2Edit: From Activation Steering to Component-Level Editing

Chung-En Sun; Ge Yan; Zimo Wang; Tsui-Wei Weng

arXiv:2602.09870·cs.CL·March 4, 2026

Steer2Edit: From Activation Steering to Component-Level Editing

Chung-En Sun, Ge Yan, Zimo Wang, Tsui-Wei Weng

PDF

Open Access

TL;DR

Steer2Edit introduces a training-free, component-level editing framework that transforms inference-time steering signals into interpretable weight modifications, improving safety, truthfulness, and reasoning efficiency in large language models.

Contribution

It provides a novel, theoretically grounded method to convert steering vectors into weight edits, enabling more precise and interpretable model behavior control without retraining.

Findings

01

Improves safety by up to 17.2%

02

Increases truthfulness by 9.8%

03

Reduces reasoning length by 12.2%

Abstract

Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications