UniDiff: Advancing Vision-Language Models with Generative and   Discriminative Learning

Xiao Dong; Runhui Huang; Xiaoyong Wei; Zequn Jie; Jianxing Yu; Jian; Yin; Xiaodan Liang

arXiv:2306.00813·cs.CV·June 2, 2023·2 cites

UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning

Xiao Dong, Runhui Huang, Xiaoyong Wei, Zequn Jie, Jianxing Yu, Jian, Yin, Xiaodan Liang

PDF

Open Access

TL;DR

UniDiff is a unified multimodal model that combines discriminative and generative learning to improve vision-language tasks, especially on small datasets, by leveraging semantic alignment and consistency without altering the pre-trained architecture.

Contribution

The paper introduces UniDiff, a novel model integrating contrastive, generative, and consistency learning for enhanced multimodal understanding and synthesis during fine-tuning.

Findings

01

Significant improvements in vision-language retrieval accuracy.

02

Enhanced quality in text-to-image generation.

03

Effective mitigation of semantic collapse during fine-tuning.

Abstract

Recent advances in vision-language pre-training have enabled machines to perform better in multimodal object discrimination (e.g., image-text semantic alignment) and image synthesis (e.g., text-to-image generation). On the other hand, fine-tuning pre-trained models with discriminative or generative capabilities such as CLIP and Stable Diffusion on domain-specific datasets has shown to be effective in various tasks by adapting to specific domains. However, few studies have explored the possibility of learning both discriminative and generative capabilities and leveraging their synergistic effects to create a powerful and personalized multimodal model during fine-tuning. This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC). UniDiff…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsDiffusion · Contrastive Language-Image Pre-training · Contrastive Learning