Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects
Xiaoyu Luo, Wenrui Yu, Qiongxiu Li, Johannes Bjerva

TL;DR
This paper provides a comprehensive theoretical and empirical analysis of memorization in diffusion language models, revealing how sampling resolution affects data extraction and showing they leak less PII than autoregressive models.
Contribution
It introduces a unified probabilistic framework for understanding memorization in diffusion language models and compares their privacy risks to autoregressive models.
Findings
Higher sampling resolution increases memorization likelihood.
Diffusion models leak less PII than autoregressive models.
Theoretical relationship between sampling resolution and memorization.
Abstract
Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. To address this gap, we present a systematic theoretical and empirical characterization of memorization in DLMs. We propose a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling resolution and memorization: increasing resolution strictly increases the probability of exact training data extraction, implying that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Authorship Attribution and Profiling · Computational and Text Analysis Methods
