TL;DR
This paper introduces DIVA, a diffusion-based post-training method that enhances CLIP's visual perception, significantly improving its fine-grained visual understanding without sacrificing its zero-shot capabilities.
Contribution
The paper presents a novel diffusion feedback approach, DIVA, that improves CLIP's visual recognition and segmentation abilities through self-supervised generative feedback.
Findings
Improves CLIP's performance on MMVP-VLM benchmark by 3-7%.
Enhances multimodal understanding and segmentation tasks.
Maintains strong zero-shot capabilities across 29 benchmarks.
Abstract
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for…
Peer Reviews
Decision·ICLR 2025 Poster
1. the paper is well written and easy to follow. 2. The improvement on the MMVP-VLM benchmark is significant. 3. The experiment is well designed and comprehensive.
1. The hypothesis of this paper is not well explained, and the solution is a little bit ad-hoc. In detail, the failure of CLIP on several typical cases mentioned in the paper should be better studied and analyzed instead of just citing a few papers and mentioning very shortly. Are the failure cases caused by the architecture design, training strategy or just the lack of training data is unknown. Only given such analysis, we can start to think of a solution to improve. On the contrary, the paper
- Learning a good discriminative representation from image generation loss is now new, but I think it still has a big potential and should be studied. In this regard, the proposed idea of fine-tuning a pre-trained discriminative model (CLIP) via a pre-trained generative model (Diffusion) is interesting and demonstrates its potential. - The authors evaluate DIVA across various benchmarks, including multimodal understanding such as MMVP-VLM, the backbone of LLaVA, zero-shot classification, and se
- Overall, this manuscript lacks detailed motivations and explanations. It is unclear why this approach is needed - e.g., Why does CLIP need diffusion feedback and how does it help? Why this particular approach is superior to others? - The training process involves sampling multiple random states of the diffusion process for each image, which can be computationally expensive. There is insufficient discussion on whether this cost is justified and whether the gains are significant enough under th
1. The paper follows a clear intuition on how to improve the CLIP model with better visual understanding. The utilization of diffusion model to improve CLIP model is overall a nice idea. 2. The paper presented a method to use both the global tokens and local tokens to improve the visual encoder. The paper also found that the percentage of used local tokens is important and conducted experiment to validate that. 2. The paper analyzes the method on multiple tasks and thus the improvement is bet
Weaknesses: - The paper's clarity of the method needs to be improved. Usually, the presentation (i.e., writing) of the paper would not be directly treated as weakness and some of them are put in suggestions below. However, the ambiguity in current method description does affect the reader's understanding of the paper method, and affect the assessment of the paper during review. Thus the reviewer put it as a weakness to highlight it and hope that it can be improved during rebuttal. I list some
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion · Contrastive Language-Image Pre-training
