Efficient Refusal Ablation in LLM through Optimal Transport
Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob

TL;DR
This paper introduces an optimal transport-based framework to transform harmful model activations into harmless ones, improving safety mechanisms in large language models while preserving their capabilities.
Contribution
We propose a novel optimal transport approach that considers the entire distribution of activations, enabling more effective and localized safety interventions in language models.
Findings
Our method outperforms state-of-the-art baselines in attack success rates.
Layer-selective interventions are more effective than full-network modifications.
The approach preserves model perplexity and capabilities.
Abstract
Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
