TL;DR
ProLIP introduces a probabilistic vision-language model trained on a large dataset, effectively estimating uncertainty and improving zero-shot and few-shot performance in image classification tasks.
Contribution
It is the first probabilistic VLM pre-trained with only probabilistic objectives, incorporating uncertainty estimation and a novel inclusion loss for better alignment.
Findings
Achieves 74.6% ImageNet zero-shot accuracy with ViT-B/16.
Improves ImageNet accuracy to 75.8% using text uncertainties in few-shot settings.
Effectively estimates uncertainty without extra parameters.
Abstract
Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an "uncertainty token" without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by…
Peer Reviews
Decision·ICLR 2025 Poster
1. Originality * The paper introduces a novel Probabilistic Vision-Language Model (PrVLM), named ProLIP, which represents a shift from deterministic vision-language models to probabilistic embeddings. This approach innovatively captures the natural ambiguity in image-text relationships, where multiple captions may describe a single image, and a single caption may match multiple images. * The concept of an uncertainty token ([UNC]) is a creative contribution, allowing the model to quantify uncert
1. Interpretability and Visualization: * Although the inclusion loss aligns with human intuitions, the visualization methods could be improved. For example, it would be interesting to see more direct comparisons with non-probabilistic models in terms of how well ProLIP aligns visual and textual hierarchies in practice. 2. Limited Exploration of Task-Specific Uncertainties: * The paper briefly touches on using uncertainty for image-to-text retrieval but could expand by exploring how uncertai
- This paper is well-motivated, starting from the many-to-many matching relationships within a batch of images and texts. It is also well-structured and easy to follow. - The authors provided strong mathematical support for the proposed learning objective. - Extensive experiments were conducted to demonstrate the effectiveness of the proposed method. - The proposed method has been proven effective on datasets containing billions of image-text pairs.
- The proposed method was trained only on ViT-B and ViT-L vision encoders. Scaling up to ViT-H and comparing with other methods are important to further demonstrate the scalability of ProLIP. - As shown in Table 1, ProLIP introduces a more complex loss design and training process compared to CLIP, yet it achieves only a 0.9% improvement in ImageNet zero-shot classification and an average gain of 0.7%. Also in Table C.1, ProLIP with ViT-L also just exhibits slight improvement over CLIP. This mod
1. The paper constructs a probabilistic VLM that can capture of uncertainty from the multimodal dataset. 2. The paper proposes the inclusion loss and also provide intuitive understanding of it. 3. The paper contains interesting analysis of the uncertainty. 4. The paper proposes to utilize the uncertainty to conduct prompt rewriting.
1. The intuition behinds the paper is that multimodal paired data can be actually many-to-many mapping. I think that this intuition would lead to a multi-modes representation intuitively, as different text can be assigned to a single image, e.g., "A dog is walking" and "A man is looking at the building". However, the modeling in this paper is a Gaussian distribution with a single mean that can not conduct good mode coverage. 1. (Minor) The basic behinds contrastive learning is that the positiv
Code & Models
- 🤗SanghyukChun/ProLIP-ViT-B-16-DC-1B-12_8Bmodel· 99 dl99 dl
- 🤗SanghyukChun/ProLIP-ViT-L-16-FT-DC-1B-1_28Bmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗SanghyukChun/ProLIP-ViT-SO400M-14-FT-DC-1B-1_28Bmodel· 10 dl· ♡ 210 dl♡ 2
- 🤗SanghyukChun/ProLIP-ViT-H-14-FT-DC-1B-1_28Bmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗SanghyukChun/LongProLIP-ViT-B-16-S128Mmodel
- 🤗SanghyukChun/LongProLIP-ViT-B-16-SHD128Mmodel· 1 dl1 dl
- 🤗SanghyukChun/LongProLIP-ViT-B-16-S24Mmodel· 1 dl1 dl
Videos
Taxonomy
TopicsNatural Language Processing Techniques
