Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position
Zhixin Xie, Xurui Song, Jun Luo

TL;DR
This paper analyzes the safety of diffusion large language models (dLLMs), revealing a critical asymmetry in token importance and proposing MOSA, a novel safety alignment method that improves security and utility across various tasks.
Contribution
It introduces the first safety analysis of dLLMs, identifies middle tokens as key to safety, and proposes MOSA, a novel alignment method tailored to their unique generation process.
Findings
MOSA outperforms eight attack methods in security benchmarks.
Aligning middle tokens enhances safety without sacrificing utility.
dLLMs tend to generate responses sequentially, limiting attack influence.
Abstract
Diffusion Large Language Models (dLLMs) have recently emerged as a competitive non-autoregressive paradigm due to their unique training and inference approach. However, there is currently a lack of safety study on this novel architecture. In this paper, we present the first analysis of dLLMs' safety performance and propose a novel safety alignment method tailored to their unique generation characteristics. Specifically, we identify a critical asymmetry between the defender and attacker in terms of security. For the defender, we reveal that the middle tokens of the response, rather than the initial ones, are more critical to the overall safety of dLLM outputs; this seems to suggest that aligning middle tokens can be more beneficial to the defender. The attacker, on the contrary, may have limited power to manipulate middle tokens, as we find dLLMs have a strong tendency towards a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
