Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings
Nilanjana Das, Manas Gaur

TL;DR
This paper investigates the internal layer-wise features of large language models that contribute to adversarial jailbreak vulnerabilities, proposing a method to identify and steer these features for improved safety.
Contribution
It introduces a three-stage pipeline to identify layer-wise feature vulnerabilities in LLMs and demonstrates that mid to later layers are more responsible for unsafe outputs.
Findings
Features in layers 16-25 are more vulnerable to steering.
Mid to later layer feature subgroups are responsible for unsafe outputs.
Targeted feature-level interventions could improve adversarial robustness.
Abstract
Large language models (LLMs) can still be jailbroken into producing harmful outputs despite safety alignment. Existing attacks show this vulnerability, but not the internal mechanisms that cause it. This study asks whether jailbreak success is driven by identifiable internal features rather than prompts alone. We propose a three-stage pipeline for Gemma-2-2B using the BeaverTails dataset. First, we extract concept-aligned tokens from adversarial responses via subspace similarity. Second, we apply three feature-grouping strategies (cluster, hierarchical-linkage, and single-token-driven) to identify SAE feature subgroups for the aligned tokens across all 26 model layers. Third, we steer the model by amplifying the top features from each identified subgroup and measure the change in harmfulness score using a standardized LLM-judge scoring protocol. In all three approaches, the features in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
