Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Vishal Pramanik; Maisha Maliha; Susmit Jha; Sumit Kumar Jha

arXiv:2604.10326·cs.CR·April 14, 2026

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Vishal Pramanik, Maisha Maliha, Susmit Jha, Sumit Kumar Jha

PDF

1 Video

TL;DR

The paper introduces HMNS, a novel geometry-aware method that identifies and suppresses causally responsible attention heads in language models to improve jailbreak attack success rates with fewer queries.

Contribution

It presents Head-Masked Nullspace Steering (HMNS), a new circuit-level intervention leveraging interpretability and geometry to enhance model subversion techniques.

Findings

01

HMNS achieves state-of-the-art attack success rates across benchmarks.

02

Fewer queries are needed compared to prior jailbreak methods.

03

Ablation studies confirm the importance of nullspace constraints and iterative re-identification.

Abstract

Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. HMNS operates in a closed-loop detection-intervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used language models, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion· slideslive