AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua

TL;DR
AlphaSteer introduces a theoretically grounded activation steering method that improves large language model safety by effectively balancing refusal to malicious prompts with maintaining performance on benign data.
Contribution
It proposes a novel null-space constrained learning approach for activation steering, enhancing safety without sacrificing utility in LLMs.
Findings
Significantly improves safety against jailbreak attacks
Maintains high utility on benign prompts
Outperforms prior methods in robustness and effectiveness
Abstract
As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and…
Peer Reviews
Decision·ICLR 2026 Poster
- Strong theoretical foundation with principled learning objectives based on null-space constraints and linear regression, providing clear mathematical grounding for the approach. - Addresses a critical limitation of existing activation steering methods with an elegant solution that treats benign and malicious prompts differently. - Comprehensive experimental evaluation across multiple jailbreak attacks (GCG, AutoDAN, PAIR, etc.) and utility benchmarks demonstrating consistent improvements. - We
- The paper would benefit from more theoretical analysis of when and why the null-space constraint successfully preserves utility, and under what conditions it might fail. - I think the paper would benefit from more details on how AlphaSteer is learned for the experiments to give a better sense of cost/scalability
1. The method grounds activation steering in a clear linear-algebraic framework: (1) preserve utility by projecting benign activations into a learned (near) null-space, and (2) enhance safety via an adaptive, data-driven refusal vector estimated in closed form. 2. Across diverse jailbreak families, the approach delivers state-of-the-art (SOTA) defense success on malicious prompts while maintaining (or minimally impacting) compliance and standard-task performance on benign prompts—consistently ou
1. The proposed method includes introduction of the computation of null-space projection matrix, but does not show whether the new computation is costly. For showing effective practical usage, it would help to compare computation with existing baselines. For example, Surgical [1] offers Inference time and Memory comparison. 2. The evaluation solely depends on GPT-4o model as LLM-for-judge for DSR (Defense Success Rate) and CR (Compliance Rate), while having no justification for the model selecti
- The proposed defense achieves a better utility score (even slightly better than standard models on average). - The paper shows theoretical grounding on its optimization of the learnable refusal vector. - The proposed method achieves a better defense success rate on average against recent jailbreak attacks. - The paper is well written and easy to read.
- The contribution may be limited as there are other existing learnable activation-steering methods considering before ICLR submission deadline. The general learnable activation steering methods include: [1] https://arxiv.org/abs/2505.20309v2 (version 1 released in May 2025) [2] https://arxiv.org/abs/2506.03292 (hypernetwork-based steering) [3] https://aclanthology.org/2024.findings-emnlp.479.pdf The reviewer skips the paper after September 2025. - The experiments are not rigorous. Better a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Network Security and Intrusion Detection · Advanced Malware Detection Techniques
