MemDLM: Memory-Enhanced DLM Training

Zehua Pei; Hui-Ling Zhen; Weizhe Lin; Sinno Jialin Pan; Yunhe Wang; Mingxuan Yuan; Bei Yu

arXiv:2603.22241·cs.CL·April 14, 2026

MemDLM: Memory-Enhanced DLM Training

Zehua Pei, Hui-Ling Zhen, Weizhe Lin, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu

PDF

1 Repo 2 Models

TL;DR

MemDLM introduces a memory-augmented training method for diffusion language models, enhancing long-context understanding and convergence by embedding a simulated denoising trajectory into training.

Contribution

It proposes a bi-level optimization approach to incorporate a parametric memory, improving DLM training efficiency and long-context performance.

Findings

01

Faster convergence compared to standard DLM training.

02

Stronger long-context representations.

03

Lower training loss with memory augmentation.

Abstract

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, standard DLM training uses a static, single-step masked prediction objective that never exposes the model to the progressive denoising dynamics of inference, and forces all contextual information to be maintained purely through token-space attention, which becomes increasingly diluted as context length grows. We propose MemDLM (Memory-Enhanced DLM), which introduces a second memory channel by embedding a simulated denoising trajectory into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience, while an outer loop updates the base model conditioned on this memory. By offloading part of the memorization burden from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JarvisPei/MemDLM
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.