Dynamic Weighted Combiner for Mixed-Modal Image Retrieval
Fuxiang Huang, Lei Zhang, Xiaowei Fu, Suqi Song

TL;DR
This paper introduces a Dynamic Weighted Combiner (DWC) for mixed-modal image retrieval that effectively addresses modality contribution disparities and labeling noise, significantly improving retrieval performance on real-world datasets.
Contribution
The paper presents a novel DWC framework with an Editable Modality De-equalizer, a dynamic soft-similarity label generator, and a CLIP-based mutual enhancement module, advancing mixed-modal retrieval methods.
Findings
Outperforms state-of-the-art methods on real-world datasets
Effectively handles modality contribution disparities
Reduces impact of labeling noise in web datasets
Abstract
Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention. However, previous approaches always achieve limited performance, due to two critical factors are seriously overlooked. 1) The contribution of image and text modalities is different, but incorrectly treated equally. 2) There exist inherent labeling noises in describing users' intentions with text in web datasets from diverse real-world scenarios, giving rise to overfitting. We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges, which includes three merits. First, we propose an Editable Modality De-equalizer (EMD) by taking into account the contribution disparity between modalities, containing two modality feature editors and an adaptive weighted combiner. Second, to alleviate labeling noises and data bias, we propose a dynamic soft-similarity label generator (SSG) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
