DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability
Runhui Huang, Jianhua Han, Guansong Lu, Xiaodan Liang, Yihan Zeng, Wei, Zhang, Hang Xu

TL;DR
DiffDis introduces a unified diffusion framework that jointly models image generation and cross-modal discrimination, enhancing both tasks by sharing architecture and leveraging diffusion-based training.
Contribution
The paper proposes DiffDis, a novel dual-stream diffusion model that unifies generative and discriminative cross-modal learning in a single architecture.
Findings
Outperforms single-task models in image synthesis quality (FID improvement).
Achieves higher accuracy in zero-shot image classification across multiple datasets.
Demonstrates effective cross-modal semantic alignment in a unified diffusion framework.
Abstract
Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2, have shown remarkable results on image synthesis. On the other hand, large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are competent for various downstream tasks by learning to align vision and language embeddings. In this paper, we explore the possibility of jointly modeling generation and discrimination. Specifically, we propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. DiffDis first formulates the image-text discriminative problem as a generative diffusion process of the text embedding from the text encoder conditioned on the image. Then, we propose a novel dual-stream network architecture, which fuses the noisy text embedding with the knowledge of latent images from different scales for image-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsALIGN · Contrastive Language-Image Pre-training · Diffusion
