On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration
Valentin Petrov

TL;DR
This study reveals that using topic-matched contrast baselines in directional abliteration for instruction-tuned models fails to produce effective refusal directions, often canceling shared activation components and hindering harm mitigation.
Contribution
The paper demonstrates that topic-matched contrast baselines are ineffective for directional abliteration, challenging a common assumption and highlighting the importance of contrast baseline selection.
Findings
Topic-matched contrast yields no functional refusal directions.
Unmatched contrast successfully eliminates harmful responses.
Geometric analysis shows cancellation of shared activation components.
Abstract
Inasmuch as the removal of refusal behavior from instruction-tuned language models by directional abliteration requires the extraction of refusal-mediating directions from the residual stream activation space, and inasmuch as the construction of the contrast baseline against which harmful prompt activations are compared has been treated in the existing literature as an implementation detail rather than a methodological concern, the present work investigates whether a topically matched contrast baseline yields superior refusal directions. The investigation is carried out on the Qwen~3.5 2B model using per-category matched prompt pairs, per-class Self-Organizing Map extraction, and Singular Value Decomposition orthogonalization. It was found that topic-matched contrast produces no functional refusal directions at any tested weight level on any tested layer, while unmatched contrast on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Logic, programming, and type systems
