Rebalanced Vision-Language Retrieval Considering Structure-Aware   Distillation

Yang Yang; Wenjuan Xi; Luping Zhou; Jinhui Tang

arXiv:2412.10761·cs.CV·December 17, 2024

Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation

Yang Yang, Wenjuan Xi, Luping Zhou, Jinhui Tang

PDF

Open Access

TL;DR

This paper addresses the challenge of modal imbalance in vision-language retrieval by proposing a structure-aware distillation method that preserves the geometric structure of representations, leading to improved cross-modal and single-modal retrieval performance.

Contribution

It introduces a novel structure-aware distillation approach for rebalancing cross-modal matching, emphasizing structure preservation to enhance retrieval accuracy.

Findings

01

Improved cross-modal retrieval performance across multiple datasets.

02

Enhanced single-modal retrieval capabilities compared to baseline models.

03

Demonstrated the effectiveness of structure-aware distillation in handling modal imbalance.

Abstract

Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques