Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

Xiaoyu Luo; Wenrui Yu; Qiongxiu Li; Johannes Bjerva

arXiv:2603.02333·cs.CL·March 4, 2026

Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

Xiaoyu Luo, Wenrui Yu, Qiongxiu Li, Johannes Bjerva

PDF

Open Access

TL;DR

This paper provides a comprehensive theoretical and empirical analysis of memorization in diffusion language models, revealing how sampling resolution affects data extraction and showing they leak less PII than autoregressive models.

Contribution

It introduces a unified probabilistic framework for understanding memorization in diffusion language models and compares their privacy risks to autoregressive models.

Findings

01

Higher sampling resolution increases memorization likelihood.

02

Diffusion models leak less PII than autoregressive models.

03

Theoretical relationship between sampling resolution and memorization.

Abstract

Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. To address this gap, we present a systematic theoretical and empirical characterization of memorization in DLMs. We propose a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling resolution and memorization: increasing resolution strictly increases the probability of exact training data extraction, implying that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Authorship Attribution and Profiling · Computational and Text Analysis Methods