Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion

Yuanfeng Xu; Yuhao Chen; Liang Lin; Guangrun Wang

arXiv:2601.04056·cs.CL·January 8, 2026

Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion

Yuanfeng Xu, Yuhao Chen, Liang Lin, Guangrun Wang

PDF

Open Access

TL;DR

This paper introduces CoM-DAD, a unified multimodal generative framework that combines discrete and continuous diffusion processes to improve stability and alignment in text-image generation.

Contribution

The paper proposes a novel hierarchical dual-process framework that decouples semantic planning from token synthesis, enabling stable, scalable multimodal generation.

Findings

01

Outperforms existing models in stability and quality

02

Effectively aligns modalities without heavy contrastive encoders

03

Establishes a new paradigm for unified text-image generation

Abstract

The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning