TextLDM: Language Modeling with Continuous Latent Diffusion
Jiaxiu Jiang, Jingjing Ren, Wenbo Li, Bo Wang, Haoze Sun, Yijun Yang, Jianhui Liu, Yanbing Zhang, Shenghe Zheng, Yuan Zhang, Haoyang Huang, Nan Duan, Wangmeng Zuo

TL;DR
TextLDM introduces a diffusion-based language model that leverages a Transformer-based VAE and latent diffusion to improve text generation, matching GPT-2 performance.
Contribution
The paper adapts visual diffusion techniques to language modeling, demonstrating effective transfer of the latent diffusion framework to text generation.
Findings
TextLDM outperforms previous diffusion language models.
It matches GPT-2 performance on OpenWebText2.
Latent feature alignment with a pretrained language model is crucial.
Abstract
Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
