Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Isaac David; Arthur Gervais

arXiv:2605.17413·cs.CR·May 19, 2026

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Isaac David, Arthur Gervais

PDF

TL;DR

This paper explores methods to remove safety alignment in language models for security tasks, evaluating the impact on security performance, safety, and general capabilities through various projection and fine-tuning techniques.

Contribution

It introduces a controlled protocol for alignment removal, compares multiple techniques, and provides empirical results on security and safety metrics.

Findings

01

Single-vector refusal projection slightly improves security score but increases unsafe compliance.

02

Rank-4 refusal-subspace projection achieves better security scores while maintaining spillover rates.

03

Task-only LoRA significantly enhances security performance with minimal unsafe compliance.

Abstract

Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.