MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation
Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun

TL;DR
MammothModa2 introduces a unified autoregressive-diffusion framework that effectively combines semantic understanding and high-fidelity image generation, enabling improved multimodal tasks without relying on pre-trained generators.
Contribution
The paper presents MammothModa2, a novel integrated AR-Diffusion model that couples semantic planning with diffusion-based synthesis for enhanced multimodal understanding and generation.
Findings
Achieves state-of-the-art results on public benchmarks.
Effectively combines understanding and generation within a single model.
Does not rely on pre-trained generators, demonstrating data efficiency.
Abstract
Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
