DuoGen: Towards General Purpose Interleaved Multimodal Generation

Min Shi; Xiaohui Zeng; Jiannan Huang; Yin Cui; Francesco Ferroni; Jialuo Li; Shubham Pachori; Zhaoshuo Li; Yogesh Balaji; Haoxiang Wang; Tsung-Yi Lin; Xiao Fu; Yue Zhao; Chieh-Yun Chen; Ming-Yu Liu; Humphrey Shi

arXiv:2602.00508·cs.CV·February 4, 2026

DuoGen: Towards General Purpose Interleaved Multimodal Generation

Min Shi, Xiaohui Zeng, Jiannan Huang, Yin Cui, Francesco Ferroni, Jialuo Li, Shubham Pachori, Zhaoshuo Li, Yogesh Balaji, Haoxiang Wang, Tsung-Yi Lin, Xiao Fu, Yue Zhao, Chieh-Yun Chen, Ming-Yu Liu, Humphrey Shi

PDF

Open Access

TL;DR

DuoGen is a versatile interleaved multimodal generation framework that combines large-scale instruction tuning, a novel architecture leveraging pretrained models, and comprehensive evaluation to outperform existing models in quality and fidelity.

Contribution

The paper introduces DuoGen, a new framework that integrates multimodal LLMs with diffusion transformers, along with a large high-quality dataset, to advance interleaved multimodal generation capabilities.

Findings

01

Outperforms prior open-source models in text quality and image fidelity.

02

Achieves state-of-the-art results in text-to-image and image editing tasks.

03

Demonstrates effective general-purpose interleaved multimodal generation.

Abstract

Interleaved multimodal generation enables capabilities beyond unimodal generation models, such as step-by-step instructional guides, visual planning, and generating visual drafts for reasoning. However, the quality of existing interleaved generation models under general instructions remains limited by insufficient training data and base model capacity. We present DuoGen, a general-purpose interleaved generation framework that systematically addresses data curation, architecture design, and evaluation. On the data side, we build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites, and diverse synthetic examples covering everyday scenarios. Architecturally, DuoGen leverages the strong visual understanding of a pretrained multimodal LLM and the visual generation capabilities of a diffusion transformer (DiT)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning