The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Zichen Wen; Jiashu Qu; Zhaorun Chen; Xiaoya Lu; Dongrui Liu; Zhiyuan Liu; Ruixi Wu; Yicun Yang; Xiangqi Jin; Haoyun Xu; Xuyang Liu; Weijia Li; Chaochao Lu; Jing Shao; Conghui He; Linfeng Zhang

arXiv:2507.11097·cs.CL·February 11, 2026

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Zichen Wen, Jiashu Qu, Zhaorun Chen, Xiaoya Lu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper uncovers a safety vulnerability in diffusion-based large language models (dLLMs), showing they can be exploited by adversarial prompts that bypass existing safety measures, highlighting the need for improved alignment techniques.

Contribution

The paper introduces DIJA, the first systematic jailbreak attack framework targeting the unique safety weaknesses of dLLMs, demonstrating significant effectiveness over previous methods.

Findings

01

DIJA achieves up to 100% keyword-based ASR on Dream-Instruct.

02

Outperforms existing jailbreak methods by up to 78.5% in evaluator-based ASR.

03

Exposes a new threat surface in dLLM architectures that current safety measures fail to address.

Abstract

Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

•Clear architectural insight. The paper crisply articulates why bidirectional infilling plus parallel decoding weakens standard guardrails, and formalizes the forcing effect created by fixing unmasked tokens while infilling masked spans. •Simple, scalable attack. DIJA uses few-shot prompt construction with masking-pattern and separator diversification; Algorithm 1 and Section 3 detail an automated pipeline that doesn’t hide harmful intent yet reliably elicits unsafe content. •Comprehensive e

Weaknesses

•Evaluator dependence & metric triangulation. While the paper uses both keyword-based and evaluator-based metrics (including StrongREJECT), a large portion of the story still relies on LLM judges and prompts (e.g., GPT-4o for HS, DIJA* for construction). Some cross-checking with human raters or multiple independent evaluators would further bolster soundness. •Interface assumptions for mask control. The attack presumes the user can inject mask tokens (e.g., [MASK] or <mask:N>) directly. Many pr

Reviewer 02Rating 6Confidence 5

Strengths

1. The DiJA attack reveals a critical vulnerability in dLLMs and has strong implications for safely open-sourcing dLLMs. The attack appears effective against multiple dLLMs on multiple standard safety benchmarks, is easy to implement and computationally inexpensive. 2. The in-context learning approach to generating the DiJA infilling templates is well-principled and clearly explained. 3. An initial attempt is made at a training-time defense, which shows that robustness to the DiJA vulnerability

Weaknesses

1. (Minor) The first example provided for interleaved mask-text prompting in Figure 1, editing/rewriting, is a bit weak. It suggests an application of fixing small typos by resampling from the model, but typos can be trivially fixed by simple spell checkers after decoding. A stronger-motivated example for using a dLLM could be paraphrasing intermediate sentences/longer phrases. 2. (Major) The proposed DiJA attack needs to be better contextualized within existing similar decoding exploits for aut

Reviewer 03Rating 4Confidence 4

Strengths

- The paper features clear writing and a well-articulated motivation, making the research gap and significance intuitive to follow. - The experimental design is comprehensive and robust, covering multiple representative general-purpose and code-oriented dLLMs, three major jailbreak benchmarks, and direct comparisons with state-of-the-art attack baselines. - The paper proactively explores defensive mechanisms to add depth to safety analysis.

Weaknesses

- The DIJA method appears too simple and just relies on a prompt template for generating interleaved mask-text prompts via in-context learning. Additionally, the paper provides no systematic analysis of the diversity of these mask-text adversarial prompts. - The practical value of the research is limited due to the nascent stage of dLLM development. At present, dLLMs still suffer from noticeable gaps in training stability and the inference ecosystem, leaving few immediate landing scenarios.

Code & Models

Repositories

zichenwen1/dija
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCorporate Insolvency and Governance