DiffDis: Empowering Generative Diffusion Model with Cross-Modal   Discrimination Capability

Runhui Huang; Jianhua Han; Guansong Lu; Xiaodan Liang; Yihan Zeng; Wei; Zhang; Hang Xu

arXiv:2308.09306·cs.CV·August 21, 2023

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Runhui Huang, Jianhua Han, Guansong Lu, Xiaodan Liang, Yihan Zeng, Wei, Zhang, Hang Xu

PDF

Open Access 1 Video

TL;DR

DiffDis introduces a unified diffusion framework that jointly models image generation and cross-modal discrimination, enhancing both tasks by sharing architecture and leveraging diffusion-based training.

Contribution

The paper proposes DiffDis, a novel dual-stream diffusion model that unifies generative and discriminative cross-modal learning in a single architecture.

Findings

01

Outperforms single-task models in image synthesis quality (FID improvement).

02

Achieves higher accuracy in zero-shot image classification across multiple datasets.

03

Demonstrates effective cross-modal semantic alignment in a unified diffusion framework.

Abstract

Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2, have shown remarkable results on image synthesis. On the other hand, large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are competent for various downstream tasks by learning to align vision and language embeddings. In this paper, we explore the possibility of jointly modeling generation and discrimination. Specifically, we propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. DiffDis first formulates the image-text discriminative problem as a generative diffusion process of the text embedding from the text encoder conditioned on the image. Then, we propose a novel dual-stream network architecture, which fuses the noisy text embedding with the knowledge of latent images from different scales for image-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsALIGN · Contrastive Language-Image Pre-training · Diffusion