Cross-Lingual Jailbreak Detection via Semantic Codebooks
Shirin Alanova, Bogdan Minko, Sabrina Sadiekh, Evgeniy Kokuykin

TL;DR
This paper proposes a language-agnostic, semantic similarity-based method to detect cross-lingual jailbreak prompts for large language models, reducing vulnerabilities without retraining.
Contribution
It introduces a training-free external guardrail using multilingual query embeddings compared against a fixed English codebook, effective across multiple languages and models.
Findings
High effectiveness on canonical jailbreak templates with near-perfect separability (AUC up to 0.99).
Significant reduction in attack success rates under strict low-FPR constraints on curated benchmarks.
Degradation in detection performance under distribution shifts and diverse unsafe benchmarks.
Abstract
Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a training-free external guardrail for black-box LLMs. We conduct a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5). Our results reveal two distinct regimes of cross-lingual transfer.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- shalanova/benchmark-1-arabic-gtdataset· 60 dl60 dl
- shalanova/benchmark-1-russian-gtdataset· 36 dl36 dl
- shalanova/benchmark-1-chinese-gtdataset· 42 dl42 dl
- shalanova/benchmark-1-arabic-m2mdataset· 129 dl129 dl
- shalanova/benchmark-1-russian-m2mdataset· 268 dl268 dl
- shalanova/benchmark-1-chinese-m2mdataset· 43 dl43 dl
- shalanova/benchmark-2-arabic-gtdataset· 36 dl36 dl
- shalanova/benchmark-2-russian-gtdataset· 197 dl197 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
