un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Yinqi Li; Jiahe Zhao; Hong Chang; Ruibing Hou; Shiguang Shan; Xilin Chen

arXiv:2505.24517·cs.CV·June 2, 2025

un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Yinqi Li, Jiahe Zhao, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces un$^2$CLIP, a method that enhances CLIP's ability to capture detailed visual information by inverting the unCLIP generative model, leading to improved performance across various vision and multimodal tasks.

Contribution

The paper proposes un$^2$CLIP, a novel approach that inverts unCLIP to improve CLIP's visual detail capturing while maintaining its language alignment.

Findings

01

un$^2$CLIP significantly outperforms original CLIP on multiple benchmarks.

02

The method improves dense-prediction and multimodal task performance.

03

Code and models will be publicly available.

Abstract

Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liyinqi/un2clip
pytorchOfficial

Models

🤗
yinqi/un2CLIP
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Open Education and E-Learning · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training