Probabilistic Embeddings for Cross-Modal Retrieval
Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio de Rezende, Yannis, Kalantidis, Diane Larlus

TL;DR
This paper introduces Probabilistic Cross-Modal Embedding (PCME), a novel approach that models image-caption pairs as probabilistic distributions to better handle one-to-many correspondences and improve retrieval performance.
Contribution
The paper proposes PCME, a probabilistic embedding method for cross-modal retrieval that captures uncertainty and improves over deterministic models, with comprehensive ablation studies.
Findings
PCME outperforms deterministic models in retrieval tasks.
It provides meaningful uncertainty estimates for embeddings.
Evaluation on COCO and CUB datasets demonstrates improved performance.
Abstract
Cross-modal retrieval methods build a common representation space for samples from multiple modalities, typically from the vision and the language domains. For images and their captions, the multiplicity of the correspondences makes the task particularly challenging. Given an image (respectively a caption), there are multiple captions (respectively images) that equally make sense. In this paper, we argue that deterministic functions are not sufficiently powerful to capture such one-to-many correspondences. Instead, we propose to use Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space. Since common benchmarks such as COCO suffer from non-exhaustive annotations for cross-modal matches, we propose to additionally evaluate retrieval on the CUB dataset, a smaller yet clean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
