Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation
Yuanhe Zhang, Fangzhou Xie, Zhenhong Zhou, Zherui Li, Hao Chen, Kun Wang, Yufei Guo

TL;DR
This paper uncovers significant safety vulnerabilities in Large Language Diffusion Models (LLDMs) by introducing a novel jailbreak method, revealing high success rates and increased risks of harmful content generation compared to traditional LLMs.
Contribution
The paper presents the PArallel Decoding jailbreak (PAD) method and demonstrates its effectiveness in exposing safety flaws in LLDMs, highlighting architectural vulnerabilities and safety concerns.
Findings
PAD achieves 97% success rate in jailbreak attacks.
LLDMs generate harmful content twice as fast as comparable LLMs.
Significant safety vulnerabilities are revealed in diffusion-based language models.
Abstract
Large Language Diffusion Models (LLDMs) exhibit comparable performance to LLMs while offering distinct advantages in inference speed and mathematical reasoning tasks.The precise and rapid generation capabilities of LLDMs amplify concerns of harmful generations, while existing jailbreak methodologies designed for Large Language Models (LLMs) prove limited effectiveness against LLDMs and fail to expose safety vulnerabilities.Successful defense cannot definitively resolve harmful generation concerns, as it remains unclear whether LLDMs possess safety robustness or existing attacks are incompatible with diffusion-based architectures.To address this, we first reveal the vulnerability of LLDMs to jailbreak and demonstrate that attack failure in LLDMs stems from fundamental architectural differences.We present a PArallel Decoding jailbreak (PAD) for diffusion-based language models. PAD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Computational and Text Analysis Methods
