Diffusion Feedback Helps CLIP See Better

Wenxuan Wang; Quan Sun; Fan Zhang; Yepeng Tang; Jing Liu; Xinlong Wang

arXiv:2407.20171·cs.CV·August 27, 2024·1 cites

Diffusion Feedback Helps CLIP See Better

Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang

PDF

Open Access 1 Repo 1 Models 1 Video 3 Reviews

TL;DR

This paper introduces DIVA, a diffusion-based post-training method that enhances CLIP's visual perception, significantly improving its fine-grained visual understanding without sacrificing its zero-shot capabilities.

Contribution

The paper presents a novel diffusion feedback approach, DIVA, that improves CLIP's visual recognition and segmentation abilities through self-supervised generative feedback.

Findings

01

Improves CLIP's performance on MMVP-VLM benchmark by 3-7%.

02

Enhances multimodal understanding and segmentation tasks.

03

Maintains strong zero-shot capabilities across 29 benchmarks.

Abstract

Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. the paper is well written and easy to follow. 2. The improvement on the MMVP-VLM benchmark is significant. 3. The experiment is well designed and comprehensive.

Weaknesses

1. The hypothesis of this paper is not well explained, and the solution is a little bit ad-hoc. In detail, the failure of CLIP on several typical cases mentioned in the paper should be better studied and analyzed instead of just citing a few papers and mentioning very shortly. Are the failure cases caused by the architecture design, training strategy or just the lack of training data is unknown. Only given such analysis, we can start to think of a solution to improve. On the contrary, the paper

Reviewer 02Rating 5Confidence 4

Strengths

- Learning a good discriminative representation from image generation loss is now new, but I think it still has a big potential and should be studied. In this regard, the proposed idea of fine-tuning a pre-trained discriminative model (CLIP) via a pre-trained generative model (Diffusion) is interesting and demonstrates its potential. - The authors evaluate DIVA across various benchmarks, including multimodal understanding such as MMVP-VLM, the backbone of LLaVA, zero-shot classification, and se

Weaknesses

- Overall, this manuscript lacks detailed motivations and explanations. It is unclear why this approach is needed - e.g., Why does CLIP need diffusion feedback and how does it help? Why this particular approach is superior to others? - The training process involves sampling multiple random states of the diffusion process for each image, which can be computationally expensive. There is insufficient discussion on whether this cost is justified and whether the gains are significant enough under th

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper follows a clear intuition on how to improve the CLIP model with better visual understanding. The utilization of diffusion model to improve CLIP model is overall a nice idea. 2. The paper presented a method to use both the global tokens and local tokens to improve the visual encoder. The paper also found that the percentage of used local tokens is important and conducted experiment to validate that. 2. The paper analyzes the method on multiple tasks and thus the improvement is bet

Weaknesses

Weaknesses: - The paper's clarity of the method needs to be improved. Usually, the presentation (i.e., writing) of the paper would not be directly treated as weakness and some of them are put in suggestions below. However, the ambiguity in current method description does affect the reader's understanding of the paper method, and affect the assessment of the paper during review. Thus the reviewer put it as a weakness to highlight it and hope that it can be improved during rebuttal. I list some

Code & Models

Repositories

baaivision/diva
pytorchOfficial

Models

🤗
BAAI/DIVA
model· ♡ 8
♡ 8

Videos

Diffusion Feedback Helps CLIP See Better· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion · Contrastive Language-Image Pre-training