Cross-Lingual Jailbreak Detection via Semantic Codebooks

Shirin Alanova; Bogdan Minko; Sabrina Sadiekh; Evgeniy Kokuykin

arXiv:2604.25716·cs.CL·April 29, 2026

Cross-Lingual Jailbreak Detection via Semantic Codebooks

Shirin Alanova, Bogdan Minko, Sabrina Sadiekh, Evgeniy Kokuykin

PDF

24 Datasets

TL;DR

This paper proposes a language-agnostic, semantic similarity-based method to detect cross-lingual jailbreak prompts for large language models, reducing vulnerabilities without retraining.

Contribution

It introduces a training-free external guardrail using multilingual query embeddings compared against a fixed English codebook, effective across multiple languages and models.

Findings

01

High effectiveness on canonical jailbreak templates with near-perfect separability (AUC up to 0.99).

02

Significant reduction in attack success rates under strict low-FPR constraints on curated benchmarks.

03

Degradation in detection performance under distribution shifts and diverse unsafe benchmarks.

Abstract

Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a training-free external guardrail for black-box LLMs. We conduct a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5). Our results reveal two distinct regimes of cross-lingual transfer.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.