Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation

Yuanhe Zhang; Fangzhou Xie; Zhenhong Zhou; Zherui Li; Hao Chen; Kun Wang; Yufei Guo

arXiv:2507.19227·cs.CL·July 28, 2025

Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation

Yuanhe Zhang, Fangzhou Xie, Zhenhong Zhou, Zherui Li, Hao Chen, Kun Wang, Yufei Guo

PDF

Open Access

TL;DR

This paper uncovers significant safety vulnerabilities in Large Language Diffusion Models (LLDMs) by introducing a novel jailbreak method, revealing high success rates and increased risks of harmful content generation compared to traditional LLMs.

Contribution

The paper presents the PArallel Decoding jailbreak (PAD) method and demonstrates its effectiveness in exposing safety flaws in LLDMs, highlighting architectural vulnerabilities and safety concerns.

Findings

01

PAD achieves 97% success rate in jailbreak attacks.

02

LLDMs generate harmful content twice as fast as comparable LLMs.

03

Significant safety vulnerabilities are revealed in diffusion-based language models.

Abstract

Large Language Diffusion Models (LLDMs) exhibit comparable performance to LLMs while offering distinct advantages in inference speed and mathematical reasoning tasks.The precise and rapid generation capabilities of LLDMs amplify concerns of harmful generations, while existing jailbreak methodologies designed for Large Language Models (LLMs) prove limited effectiveness against LLDMs and fail to expose safety vulnerabilities.Successful defense cannot definitively resolve harmful generation concerns, as it remains unclear whether LLDMs possess safety robustness or existing attacks are incompatible with diffusion-based architectures.To address this, we first reveal the vulnerability of LLDMs to jailbreak and demonstrate that attack failure in LLDMs stems from fundamental architectural differences.We present a PArallel Decoding jailbreak (PAD) for diffusion-based language models. PAD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Computational and Text Analysis Methods