Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs

Darpan Aswal; Siddharth D Jaiswal

arXiv:2505.14226·cs.CL·April 8, 2026

Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs

Darpan Aswal, Siddharth D Jaiswal

PDF

TL;DR

This paper identifies a phonetic-based vulnerability in safety-aligned large language models caused by tokenization issues, which can be exploited to bypass safety measures.

Contribution

It introduces CMP-RT, a diagnostic tool that reveals how phonetic perturbations fragment safety tokens, exposing a structural gap in current safety mechanisms.

Findings

01

Phonetic perturbations cause safety tokens to fragment into benign sub-words.

02

Standard defenses fail to mitigate this vulnerability across models and modalities.

03

Layer-wise analysis shows structural gaps between pre-training and alignment stages.

Abstract

Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical perturbations to words but preserve the phonetics. We introduce CMP-RT (code-mixed phonetic perturbations for red-teaming), a novel diagnostic probe that pinpoints tokenization as the root cause of this vulnerability. A mechanistic analysis reveals that phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing their attribution scores while preserving prompt interpretability -- causing safety mechanisms to fail despite excellent input understanding. We demonstrate that this vulnerability evades standard defenses, persists across modalities and state-of-the-art (SOTA) models including Gemini-3-Pro, and scales through simple supervised fine-tuning (SFT). Furthermore, layer-wise probing shows perturbed and canonical input representations align up to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.