SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval
Wenjie Yang, Hang Yu, Yuyu Guo, Peng Di

TL;DR
This paper introduces SOLAR, a self-supervised framework for symmetric multimodal retrieval that leverages unlabeled web data and outperforms supervised models with fewer parameters.
Contribution
The authors propose a novel two-stage self-supervised learning method for symmetric multimodal retrieval, along with a new benchmark for evaluation.
Findings
SOLAR surpasses the strongest supervised VLM by 7.08 points on the new benchmark.
It achieves this with over 50x fewer parameters and a 5x smaller embedding dimension.
Extensive experiments demonstrate its effectiveness over ten state-of-the-art methods.
Abstract
In this work, we address the critical yet underexplored challenge of symmetric multimodal-to-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
