Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Sch\"utze, Barbara Plank

TL;DR
This paper demonstrates that the refusal behavior in large language models is consistent across 14 languages, with a universal refusal direction in embedding space enabling cross-lingual safety interventions without additional training.
Contribution
It uncovers the cross-lingual universality of refusal directions in LLMs and explains the underlying mechanism behind cross-lingual jailbreaks, advancing multilingual safety strategies.
Findings
Refusal directions are transferable across languages in LLMs.
A single English-derived refusal vector can bypass refusals in multiple languages.
Refusal vectors exhibit parallelism in embedding space across languages.
Abstract
Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
