SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval
Minyoung Kim

TL;DR
This paper introduces SwAMP, a novel cross-modal retrieval method that uses self-labeled swapped class assignments to improve semantic alignment across modalities, outperforming traditional contrastive learning.
Contribution
The paper proposes a new loss function based on self-labeling and swapping class labels to enhance cross-modal retrieval performance.
Findings
Significant improvement over contrastive learning in multiple retrieval tasks
Effective semantic alignment across modalities using swapped pseudo labels
Applicable to text-video, sketch-image, and image-text retrieval
Abstract
We tackle the cross-modal retrieval problem, where learning is only supervised by relevant multi-modal pairs in the data. Although the contrastive learning is the most popular approach for this task, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address the issue, we propose a novel loss function that is based on self-labeling of the unknown semantic classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss. This way, cross-modal instances from different pairs that are semantically related can be aligned to each other by the class predictor. We tested our approach on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
