Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications
Isaac David, Arthur Gervais

TL;DR
This paper explores methods to remove safety alignment in language models for security tasks, evaluating the impact on security performance, safety, and general capabilities through various projection and fine-tuning techniques.
Contribution
It introduces a controlled protocol for alignment removal, compares multiple techniques, and provides empirical results on security and safety metrics.
Findings
Single-vector refusal projection slightly improves security score but increases unsafe compliance.
Rank-4 refusal-subspace projection achieves better security scores while maintaining spillover rates.
Task-only LoRA significantly enhances security performance with minimal unsafe compliance.
Abstract
Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
