SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Giorgio Piras; Raffaele Mura; Fabio Brau; Luca Oneto; Fabio Roli; Battista Biggio

arXiv:2511.08379·cs.AI·March 25, 2026

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, Battista Biggio

PDF

8 Models 1 Video

TL;DR

This paper introduces a novel method using Self-Organizing Maps to identify multiple directions in language models' latent space for better refusal behavior suppression, outperforming previous single-direction approaches.

Contribution

The paper proposes a new technique leveraging SOMs to extract multiple refusal directions, improving safety measures in language models over prior single-direction methods.

Findings

01

Multiple refusal directions improve suppression effectiveness.

02

Ablating multiple directions outperforms single-direction baselines.

03

Method surpasses specialized jailbreak algorithms.

Abstract

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

SOM Directions Are Better than One: Multi-Directional Refusal Suppression in Language Models· underline