TL;DR
This paper introduces a novel method using Self-Organizing Maps to identify multiple directions in language models' latent space for better refusal behavior suppression, outperforming previous single-direction approaches.
Contribution
The paper proposes a new technique leveraging SOMs to extract multiple refusal directions, improving safety measures in language models over prior single-direction methods.
Findings
Multiple refusal directions improve suppression effectiveness.
Ablating multiple directions outperforms single-direction baselines.
Method surpasses specialized jailbreak algorithms.
Abstract
Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗hell0ks/Solar-Open-100B-jailbreak-ggufmodel· 69 dl· ♡ 269 dl♡ 2
- 🤗hell0ks/Solar-Open-100B-jailbreakmodel· 7 dl· ♡ 47 dl♡ 4
- 🤗kabachuha/gpt-oss-20b-SOMbliteratedmodel· 123 dl· ♡ 6123 dl♡ 6
- 🤗MagicalAlchemist/Apriel-1.6-15b-Thinker-Magic_beta-decensoredmodel· 32 dl· ♡ 232 dl♡ 2
- 🤗Magic-Decensored/Apriel-1.6-15b-Thinker-Magic_beta-decensored-GGUFmodel· 255 dl· ♡ 2255 dl♡ 2
- 🤗kabachuha/Qwen3-4B-Instruct-2507-SOMbliteratedmodel· 109 dl· ♡ 5109 dl♡ 5
- 🤗Beinsezii/llmfan46-Qwen3.5-27B-heretic-v2-GGUF-6.11BPWmodel· 791 dl· ♡ 2791 dl♡ 2
- 🤗InMecha/Qwen3.5-2B-Gorgona-R0-KL0.0079-03152026model· 1.1k dl· ♡ 41.1k dl♡ 4
Videos
