Dual Diffusion for Unified Image Generation and Understanding
Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger,, Linjie Yang, Peng Wang

TL;DR
This paper introduces a large-scale, end-to-end diffusion model that unifies image generation and understanding tasks, outperforming existing diffusion models and rivaling autoregressive models in versatility.
Contribution
It presents the first fully end-to-end multimodal diffusion model supporting comprehensive vision-language tasks with a novel joint training framework.
Findings
Achieved competitive performance on multiple vision-language benchmarks.
Demonstrated flexibility across tasks like image generation, captioning, and VQA.
Showed potential of diffusion models as an alternative to autoregressive approaches.
Abstract
Diffusion models have gained tremendous success in text-to-image generation, yet still lag behind with visual understanding tasks, an area dominated by autoregressive vision-language models. We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation that significantly improves on existing diffusion-based multimodal models, and is the first of its kind to support the full suite of vision-language modeling capabilities. Inspired by the multimodal diffusion transformer (MM-DiT) and recent advances in discrete diffusion language modeling, we leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly under a single loss function, which is back-propagated through both branches of the diffusion transformer. The resulting model is highly flexible and capable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsDiffusion
