Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation
Yang Yang, Wenjuan Xi, Luping Zhou, Jinhui Tang

TL;DR
This paper addresses the challenge of modal imbalance in vision-language retrieval by proposing a structure-aware distillation method that preserves the geometric structure of representations, leading to improved cross-modal and single-modal retrieval performance.
Contribution
It introduces a novel structure-aware distillation approach for rebalancing cross-modal matching, emphasizing structure preservation to enhance retrieval accuracy.
Findings
Improved cross-modal retrieval performance across multiple datasets.
Enhanced single-modal retrieval capabilities compared to baseline models.
Demonstrated the effectiveness of structure-aware distillation in handling modal imbalance.
Abstract
Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
