On the Geometric Limits of Transformer Defenses against Obfuscation Attacks: Latent Embedding Collapse & Performance Robustness Gap

Becky Mashaido; Tapadhir Das

arXiv:2605.19159·cs.CR·May 20, 2026

On the Geometric Limits of Transformer Defenses against Obfuscation Attacks: Latent Embedding Collapse & Performance Robustness Gap

Becky Mashaido, Tapadhir Das

PDF

TL;DR

This paper reveals that high detection accuracy of prompt injection attacks does not ensure representational robustness, as obfuscated prompts can cause embedding collapse and latent-space instability.

Contribution

It introduces the concept of latent embedding collapse, demonstrating the performance-robustness gap in prompt injection defenses across multiple BERT models.

Findings

01

Obfuscated prompts partially collapse onto clean prompt embeddings.

02

High classification performance does not prevent embedding overlap and instability.

03

Increasing model capacity does not mitigate latent embedding collapse.

Abstract

Prompt injection attacks pose significant risks to language model safety, yet existing defenses are typically evaluated using classification performance. We show that high detection performance does not imply representational robustness. Specifically, multi-operator obfuscated prompts (combining homoglyphs, zero-width characters, and punctuation or emoji noise) can partially collapse onto the embedding manifold of clean prompts, a phenomenon we term latent embedding collapse. Results indicate that across multiple BERT family encoders with varying depth and capacity, detectors achieve near-perfect classification performance, yet the minimal clean-obfuscated margin delta = 1.02, indicating near-overlap of obfuscated and clean embeddings. Obfuscated embeddings further exhibit elevated intra-class variance (3.33 +/- 6.23), indicating severe latent-space instability despite high performance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.