TL;DR
MELT is a novel network that improves composed image retrieval by balancing the focus on rare modifications and robustly handling hard negatives, leading to better retrieval accuracy.
Contribution
The paper introduces MELT, a network that addresses frequency bias and similarity interference in CIR through semantic localization and diffusion-based denoising.
Findings
MELT outperforms existing methods on two CIR benchmarks.
The approach effectively localizes rare semantic modifications.
Diffusion-based denoising improves robustness against hard negatives.
Abstract
Composed Image Retrieval (CIR) uses a reference image and a modification text as a query to retrieve a target image satisfying the requirement of ``modifying the reference image according to the text instructions''. However, existing CIR methods face two limitations: (1) frequency bias leading to ``Rare Sample Neglect'', and (2) susceptibility of similarity scores to interference from hard negative samples and noise. To address these limitations, we confront two key challenges: asymmetric rare semantic localization and robust similarity estimation under hard negative samples. To solve these challenges, we propose the Modification frEquentation-rarity baLance neTwork MELT. MELT assigns increased attention to rare modification semantics in multimodal contexts while applying diffusion-based denoising to hard negative samples with high similarity scores, enhancing multimodal fusion and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
