un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
Yinqi Li, Jiahe Zhao, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen

TL;DR
This paper introduces un$^2$CLIP, a method that enhances CLIP's ability to capture detailed visual information by inverting the unCLIP generative model, leading to improved performance across various vision and multimodal tasks.
Contribution
The paper proposes un$^2$CLIP, a novel approach that inverts unCLIP to improve CLIP's visual detail capturing while maintaining its language alignment.
Findings
un$^2$CLIP significantly outperforms original CLIP on multiple benchmarks.
The method improves dense-prediction and multimodal task performance.
Code and models will be publicly available.
Abstract
Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Open Education and E-Learning · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
