Efficient Refusal Ablation in LLM through Optimal Transport

Geraldin Nanfack; Eugene Belilovsky; Elvis Dohmatob

arXiv:2603.04355·cs.LG·March 5, 2026

Efficient Refusal Ablation in LLM through Optimal Transport

Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob

PDF

Open Access

TL;DR

This paper introduces an optimal transport-based framework to transform harmful model activations into harmless ones, improving safety mechanisms in large language models while preserving their capabilities.

Contribution

We propose a novel optimal transport approach that considers the entire distribution of activations, enabling more effective and localized safety interventions in language models.

Findings

01

Our method outperforms state-of-the-art baselines in attack success rates.

02

Layer-selective interventions are more effective than full-network modifications.

03

The approach preserves model perplexity and capabilities.

Abstract

Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)