Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, Ningyu Zhang

TL;DR
This paper introduces Steering Target Atoms (STA), a novel method for precise and robust control of language model behaviors by isolating knowledge components, improving safety and flexibility especially in adversarial and reasoning tasks.
Contribution
The paper proposes STA, a new technique to isolate and manipulate knowledge components in LLMs, enhancing control precision and robustness beyond existing prompt engineering methods.
Findings
STA improves safety and control precision in LLMs.
Steering exhibits robustness in adversarial scenarios.
Effective in large reasoning models for precise control.
Abstract
Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Materials Characterization Techniques · Radiation Effects in Electronics
