TL;DR
This paper introduces INTENT, a novel noise mitigation framework for composed image retrieval that addresses both cross-modal and modality-inherent noise, improving robustness in real-world noisy datasets.
Contribution
The paper proposes a dual-component model, INTENT, combining visual invariance via FFT and discriminative learning to handle different noise types in CIR datasets.
Findings
INTENT outperforms existing methods on benchmark datasets.
The approach effectively reduces the impact of annotation errors.
Experimental results show improved robustness and accuracy.
Abstract
Composed Image Retrieval (CIR) is a challenging image retrieval paradigm that enables to retrieve target images based on multimodal queries consisting of reference images and modification texts. Although substantial progress has been made in recent years, existing methods assume that all samples are correctly matched. However, in real-world scenarios, due to high triplet annotation costs, CIR datasets inevitably contain annotation errors, resulting in incorrectly matched triplets. To address this issue, the problem of Noisy Triplet Correspondence (NTC) has attracted growing attention. We argue that noise in CIR can be categorized into two types: cross-modal correspondence noise and modality-inherent noise. The former arises from mismatches across modalities, whereas the latter originates from intra-modal background interference or visual factors irrelevant to the coarse-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
