Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

Mengru Wang; Ziwen Xu; Shengyu Mao; Shumin Deng; Zhaopeng Tu; Huajun Chen; Ningyu Zhang

arXiv:2505.20322·cs.CL·June 4, 2025

Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, Ningyu Zhang

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

This paper introduces Steering Target Atoms (STA), a novel method for precise and robust control of language model behaviors by isolating knowledge components, improving safety and flexibility especially in adversarial and reasoning tasks.

Contribution

The paper proposes STA, a new technique to isolate and manipulate knowledge components in LLMs, enhancing control precision and robustness beyond existing prompt engineering methods.

Findings

01

STA improves safety and control precision in LLMs.

02

Steering exhibits robustness in adversarial scenarios.

03

Effective in large reasoning models for precise control.

Abstract

Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zjunlp/steer-target-atoms
jaxOfficial

Datasets

Videos

Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms· underline

Taxonomy

TopicsAdvanced Materials Characterization Techniques · Radiation Effects in Electronics