MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

Tao Shen; Xin Wan; Taicai Chen; Rui Zhang; Junwen Pan; Dawei Lu; Fanding Lei; Zhilin Lu; Yunfei Yang; Chen Cheng; Qi She; Chang Liu; Zhenbang Sun

arXiv:2511.18262·cs.CV·November 25, 2025

MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun

PDF

Open Access 1 Models

TL;DR

MammothModa2 introduces a unified autoregressive-diffusion framework that effectively combines semantic understanding and high-fidelity image generation, enabling improved multimodal tasks without relying on pre-trained generators.

Contribution

The paper presents MammothModa2, a novel integrated AR-Diffusion model that couples semantic planning with diffusion-based synthesis for enhanced multimodal understanding and generation.

Findings

01

Achieves state-of-the-art results on public benchmarks.

02

Effectively combines understanding and generation within a single model.

03

Does not rely on pre-trained generators, demonstrating data efficiency.

Abstract

Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
bytedance-research/MammothModa2-Dev
model· 7 dl· ♡ 3
7 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning