On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration

Valentin Petrov

arXiv:2603.22061·cs.LG·March 24, 2026

On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration

Valentin Petrov

PDF

Open Access

TL;DR

This study reveals that using topic-matched contrast baselines in directional abliteration for instruction-tuned models fails to produce effective refusal directions, often canceling shared activation components and hindering harm mitigation.

Contribution

The paper demonstrates that topic-matched contrast baselines are ineffective for directional abliteration, challenging a common assumption and highlighting the importance of contrast baseline selection.

Findings

01

Topic-matched contrast yields no functional refusal directions.

02

Unmatched contrast successfully eliminates harmful responses.

03

Geometric analysis shows cancellation of shared activation components.

Abstract

Inasmuch as the removal of refusal behavior from instruction-tuned language models by directional abliteration requires the extraction of refusal-mediating directions from the residual stream activation space, and inasmuch as the construction of the contrast baseline against which harmful prompt activations are compared has been treated in the existing literature as an implementation detail rather than a methodological concern, the present work investigates whether a topically matched contrast baseline yields superior refusal directions. The investigation is carried out on the Qwen~3.5 2B model using per-category matched prompt pairs, per-class Self-Organizing Map extraction, and Singular Value Decomposition orthogonalization. It was found that topic-matched contrast produces no functional refusal directions at any tested weight level on any tested layer, while unmatched contrast on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Logic, programming, and type systems