On the Geometric Limits of Transformer Defenses against Obfuscation Attacks: Latent Embedding Collapse & Performance Robustness Gap
Becky Mashaido, Tapadhir Das

TL;DR
This paper reveals that high detection accuracy of prompt injection attacks does not ensure representational robustness, as obfuscated prompts can cause embedding collapse and latent-space instability.
Contribution
It introduces the concept of latent embedding collapse, demonstrating the performance-robustness gap in prompt injection defenses across multiple BERT models.
Findings
Obfuscated prompts partially collapse onto clean prompt embeddings.
High classification performance does not prevent embedding overlap and instability.
Increasing model capacity does not mitigate latent embedding collapse.
Abstract
Prompt injection attacks pose significant risks to language model safety, yet existing defenses are typically evaluated using classification performance. We show that high detection performance does not imply representational robustness. Specifically, multi-operator obfuscated prompts (combining homoglyphs, zero-width characters, and punctuation or emoji noise) can partially collapse onto the embedding manifold of clean prompts, a phenomenon we term latent embedding collapse. Results indicate that across multiple BERT family encoders with varying depth and capacity, detectors achieve near-perfect classification performance, yet the minimal clean-obfuscated margin delta = 1.02, indicating near-overlap of obfuscated and clean embeddings. Obfuscated embeddings further exhibit elevated intra-class variance (3.33 +/- 6.23), indicating severe latent-space instability despite high performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
