UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning
Xiao Dong, Runhui Huang, Xiaoyong Wei, Zequn Jie, Jianxing Yu, Jian, Yin, Xiaodan Liang

TL;DR
UniDiff is a unified multimodal model that combines discriminative and generative learning to improve vision-language tasks, especially on small datasets, by leveraging semantic alignment and consistency without altering the pre-trained architecture.
Contribution
The paper introduces UniDiff, a novel model integrating contrastive, generative, and consistency learning for enhanced multimodal understanding and synthesis during fine-tuning.
Findings
Significant improvements in vision-language retrieval accuracy.
Enhanced quality in text-to-image generation.
Effective mitigation of semantic collapse during fine-tuning.
Abstract
Recent advances in vision-language pre-training have enabled machines to perform better in multimodal object discrimination (e.g., image-text semantic alignment) and image synthesis (e.g., text-to-image generation). On the other hand, fine-tuning pre-trained models with discriminative or generative capabilities such as CLIP and Stable Diffusion on domain-specific datasets has shown to be effective in various tasks by adapting to specific domains. However, few studies have explored the possibility of learning both discriminative and generative capabilities and leveraging their synergistic effects to create a powerful and personalized multimodal model during fine-tuning. This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC). UniDiff…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsDiffusion · Contrastive Language-Image Pre-training · Contrastive Learning
