TL;DR
The paper introduces HMNS, a novel geometry-aware method that identifies and suppresses causally responsible attention heads in language models to improve jailbreak attack success rates with fewer queries.
Contribution
It presents Head-Masked Nullspace Steering (HMNS), a new circuit-level intervention leveraging interpretability and geometry to enhance model subversion techniques.
Findings
HMNS achieves state-of-the-art attack success rates across benchmarks.
Fewer queries are needed compared to prior jailbreak methods.
Ablation studies confirm the importance of nullspace constraints and iterative re-identification.
Abstract
Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. HMNS operates in a closed-loop detection-intervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used language models, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
