Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval

Jonghyun Song; Youngjune Lee; Gyu-Hwung Cho; Ilhyeon Song; Saehun Kim; Yohan Jo

arXiv:2508.16707·cs.CL·August 26, 2025

Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval

Jonghyun Song, Youngjune Lee, Gyu-Hwung Cho, Ilhyeon Song, Saehun Kim, Yohan Jo

PDF

TL;DR

This paper introduces a joint training framework for text-image retrieval that combines sparse and dense representations via self-knowledge distillation, improving performance and efficiency over existing methods.

Contribution

It proposes a bi-directional learning approach for sparse and dense retrieval models using shared similarity scores and fine-tuning, enhancing multimodal retrieval performance.

Findings

01

Outperforms existing sparse retrieval baselines.

02

Achieves comparable or better performance than dense models.

03

Retains efficiency and interpretability of sparse models.

Abstract

Vision-Language Pretrained (VLP) models have achieved impressive performance on multimodal tasks, including text-image retrieval, based on dense representations. Meanwhile, Learned Sparse Retrieval (LSR) has gained traction in text-only settings due to its interpretability and efficiency with fast term-based lookup via inverted indexes. Inspired by these advantages, recent work has extended LSR to the multimodal domain. However, these methods often rely on computationally expensive contrastive pre-training, or distillation from a frozen dense model, which limits the potential for mutual enhancement. To address these limitations, we propose a simple yet effective framework that enables bi-directional learning between dense and sparse representations through Self-Knowledge Distillation. This bi-directional learning is achieved using an integrated similarity score-a weighted sum of dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.