Improved Probabilistic Image-Text Representations
Sanghyuk Chun

TL;DR
This paper introduces PCME++, an improved probabilistic embedding method for image-text matching that reduces computational complexity, addresses false negatives, and enhances robustness, outperforming existing methods on multiple benchmarks.
Contribution
The paper proposes a novel probabilistic distance with a closed-form solution and two optimization techniques, advancing probabilistic image-text representations.
Findings
PCME++ outperforms state-of-the-art ITM methods on MS-COCO, CxC, and ECCV benchmarks.
The method demonstrates robustness under noisy image-text correspondences.
Potential application in zero-shot classification via automatic prompt filtering.
Abstract
Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; the burden of heavy computations due to the Monte Carlo approximation, and the loss saturation issue in the face of abundant false negatives. To overcome the issues, this paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution. In addition, two optimization techniques are proposed to enhance PCME++ further: first, the incorporation of pseudo-positives to prevent the negative effect under massive false…
Peer Reviews
Decision·ICLR 2024 poster
1. The representation is good and easy to follow. 2. The robustness of PCME++ is evaluated under noisy image-text correspondences. 3. The most of the figures in this paper is informative, especially Figure1 and Figure5.
1. This work is deeply coupled with PCME( Chun et al. (2021)), which might limiting the inspiration and extensibility for other work. 2. As shown in Figure 5, the visualization could show the uncertainty of learned embeddings of visual features and textual features. However, the proposed closed-form sampled distance (CSD) could only simply measure the uncertainty but not the area of the overlap between two modality. I hope the authors could make more analysis. 3. Some techniques such as Mixup
- Serving as its major motivation, this paper analyzes the many-to-many nature and other inherent issues in the Image-Text Matching task. Focusing on this fundamental Vision Language (VL) downstream task, this work addresses shortcomings of previous uncertainty-based methods such as PCME. - This presentation of this work is clear and easy to follow. In particular, the figures, such as Figure 2, are informative in illustrating important contexts. - The authors provide solid experimental resu
- About the advantage of PCME++, the authors mentioned the advantage of PCME++ when scaling-up backbones, which might be lacking further discussions. (See Questions)
- well organized and easy to follow. - the paper focuses on two important issues in image text matching: many-to-many matching and sparse annotations.
- Presented as Table 1 and 2, when ViT-B/32 was chosen as the backbone, the performance of PCME++ is inferior to VSE\infnite. Additionally, the performances of P2RM and DAA published in the original paper actually won PCME with a large margin, while their performances presented here are lower than PCME. The authors should make necessary explanations on these issues. Most importantly, could the proposed PP and MSDA be applied to P2RM and DAA to enhance the many-to-many performance? If not, ple
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Video Analysis and Summarization · Handwritten Text Recognition Techniques
