Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode
Zeyuan He, Yupeng Chen, Lang Lin, Yihan Wang, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, Jialin Yu

TL;DR
Diffusion large language models (D-LLMs) are inherently more robust against jailbreak attacks due to their generation process, but simple context nesting strategies can bypass this safety feature, revealing critical vulnerabilities.
Contribution
This work analyzes the safety benefits of D-LLMs' diffusion process and uncovers a simple context nesting attack that can bypass their safety mechanisms.
Findings
Diffusion trajectory induces a stepwise reduction suppressing unsafe outputs.
Context nesting can bypass D-LLMs' safety, achieving high attack success rates.
First successful jailbreak of Gemini Diffusion exposes a critical vulnerability.
Abstract
Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
