Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
Darpan Aswal, Siddharth D Jaiswal

TL;DR
This paper identifies a phonetic-based vulnerability in safety-aligned large language models caused by tokenization issues, which can be exploited to bypass safety measures.
Contribution
It introduces CMP-RT, a diagnostic tool that reveals how phonetic perturbations fragment safety tokens, exposing a structural gap in current safety mechanisms.
Findings
Phonetic perturbations cause safety tokens to fragment into benign sub-words.
Standard defenses fail to mitigate this vulnerability across models and modalities.
Layer-wise analysis shows structural gaps between pre-training and alignment stages.
Abstract
Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical perturbations to words but preserve the phonetics. We introduce CMP-RT (code-mixed phonetic perturbations for red-teaming), a novel diagnostic probe that pinpoints tokenization as the root cause of this vulnerability. A mechanistic analysis reveals that phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing their attribution scores while preserving prompt interpretability -- causing safety mechanisms to fail despite excellent input understanding. We demonstrate that this vulnerability evades standard defenses, persists across modalities and state-of-the-art (SOTA) models including Gemini-3-Pro, and scales through simple supervised fine-tuning (SFT). Furthermore, layer-wise probing shows perturbed and canonical input representations align up to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
