SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval

Wenjie Yang; Hang Yu; Yuyu Guo; Peng Di

arXiv:2605.15868·cs.CV·May 18, 2026

SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval

Wenjie Yang, Hang Yu, Yuyu Guo, Peng Di

PDF

TL;DR

This paper introduces SOLAR, a self-supervised framework for symmetric multimodal retrieval that leverages unlabeled web data and outperforms supervised models with fewer parameters.

Contribution

The authors propose a novel two-stage self-supervised learning method for symmetric multimodal retrieval, along with a new benchmark for evaluation.

Findings

01

SOLAR surpasses the strongest supervised VLM by 7.08 points on the new benchmark.

02

It achieves this with over 50x fewer parameters and a 5x smaller embedding dimension.

03

Extensive experiments demonstrate its effectiveness over ten state-of-the-art methods.

Abstract

In this work, we address the critical yet underexplored challenge of symmetric multimodal-to-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.