Rethinking Token Prediction: Tree-Structured Diffusion Language Model
Zihao Wu, Haoming Yang, Juncheng Dong, Vahid Tarokh

TL;DR
This paper introduces a tree-structured diffusion language model that reduces memory usage and maintains performance by exploiting token hierarchy, challenging the necessity of full-vocabulary prediction layers.
Contribution
It proposes a novel tree-structured diffusion approach that significantly decreases memory requirements and parameter count while preserving language modeling effectiveness.
Findings
Reduces peak GPU memory by 50% compared to state-of-the-art models.
Maintains perplexity performance with fewer parameters and memory.
Demonstrates efficiency gains under limited training resources.
Abstract
Discrete diffusion language models have emerged as a competitive alternative to auto-regressive language models, but training them efficiently under limited parameter and memory budgets remains challenging. Modern architectures are predominantly based on a full-vocabulary token prediction layer, which accounts for a substantial fraction of model parameters (e.g., more than 20% in small scale DiT-style designs) and often dominates peak GPU memory usage. This leads to inefficient use of both parameters and memory under constrained training resources. To address this issue, we revisit the necessity of explicit full-vocabulary prediction, and instead exploit the inherent structure among tokens to build a tree-structured diffusion language model. Specifically, we model the diffusion process with intermediate latent states corresponding to a token's ancestor nodes in a pre-constructed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
