SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails
Zhiyi Mou, Jingyuan Yang, Zeheng Qian, Wangze Ni, Tianfang Xiao, Ning Liu, Chen Zhang, Zhan Qin, Kui Ren

TL;DR
SpatialJB exploits the spatial weaknesses in LLMs' token representations to bypass guardrails, revealing a critical vulnerability and prompting the development of new defense strategies for safer deployment.
Contribution
The paper introduces SpatialJB, a novel attack method that leverages spatial perturbations to bypass LLM guardrails, exposing a significant security vulnerability.
Findings
SpatialJB achieves nearly 100% attack success rate on leading LLMs.
Even with advanced guardrails, SpatialJB maintains over 75% success rate.
Baseline defenses show limited effectiveness against SpatialJB.
Abstract
While Large Language Models (LLMs) have powerful capabilities, they remain vulnerable to jailbreak attacks, which is a critical barrier to their safe web real-time application. Current commercial LLM providers deploy output guardrails to filter harmful outputs, yet these defenses are not impenetrable. Due to LLMs' reliance on autoregressive, token-by-token inference, their semantic representations lack robustness to spatially structured perturbations, such as redistributing tokens across different rows, columns, or diagonals. Exploiting the Transformer's spatial weakness, we propose SpatialJB to disrupt the model's output generation process, allowing harmful content to bypass guardrails without detection. Comprehensive experiments conducted on leading LLMs get nearly 100% ASR, demonstrating the high effectiveness of SpatialJB. Even after adding advanced output guardrails, like the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Web Application Security Vulnerabilities
