Multimodal Representation Alignment for Cross-modal Information Retrieval
Fan Xu, Luis A. Leiva

TL;DR
This paper explores how to effectively align visual and textual representations for cross-modal retrieval, demonstrating that cosine similarity outperforms other metrics and highlighting the limitations of simple neural architectures.
Contribution
The study provides a comprehensive analysis of geometric relationships between modalities and evaluates multiple similarity metrics, introducing insights for improved multimodal retrieval systems.
Findings
Cosine similarity outperforms other metrics in feature alignment.
Wasserstein distance effectively measures the modality gap.
Multilayer perceptrons are insufficient for complex cross-modal interactions.
Abstract
Different machine learning models can represent the same underlying concept in different ways. This variability is particularly valuable for in-the-wild multimodal retrieval, where the objective is to identify the corresponding representation in one modality given another modality as input. This challenge can be effectively framed as a feature alignment problem. For example, given a sentence encoded by a language model, retrieve the most semantically aligned image based on features produced by an image encoder, or vice versa. In this work, we first investigate the geometric relationships between visual and textual embeddings derived from both vision-language models and combined unimodal models. We then align these representations using four standard similarity metrics as well as two learned ones, implemented via neural networks. Our findings indicate that the Wasserstein distance can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
MethodsALIGN
