Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

Quy-Anh Dang; Chris Ngo

arXiv:2601.19375·cs.LG·January 28, 2026

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

Quy-Anh Dang, Chris Ngo

PDF

Open Access

TL;DR

Selective Steering introduces a norm-preserving, discriminative layer selection method for controlling large language models, significantly improving attack success rates while maintaining model stability and performance.

Contribution

It presents a mathematically rigorous norm-preserving rotation formulation combined with discriminative layer selection, addressing limitations of previous activation steering methods.

Findings

01

Achieves 5.5x higher attack success rates than prior methods.

02

Maintains zero perplexity violations and near 100% capability retention.

03

Demonstrates effectiveness across nine different models.

Abstract

Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference-time intervention approach, but existing methods suffer from critical limitations: activation addition requires careful coefficient tuning and is sensitive to layer-specific norm variations, while directional ablation provides only binary control. Recent work on Angular Steering introduces continuous control via rotation in a 2D subspace, but its practical implementation violates norm preservation, causing distribution shift and generation collapse, particularly in models below 7B parameters. We propose Selective Steering, which addresses these limitations through two key innovations: (1) a mathematically rigorous norm-preserving rotation formulation that maintains activation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)