TL;DR
This paper introduces HINT, a dual-path network that enhances composed image retrieval by incorporating contextual information and amplifying similarity differences, leading to superior performance on benchmark datasets.
Contribution
The paper proposes a novel dual-path compositional contextualized network (HINT) that effectively encodes context and amplifies similarity differences for improved CIR.
Findings
HINT achieves state-of-the-art results on two CIR benchmarks.
HINT effectively encodes contextual information to distinguish matching samples.
HINT improves performance in complex CIR scenarios.
Abstract
Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the neglect of contextual information in discriminating matching samples. However, addressing this limitation is not an easy task due to two challenges: 1) implicit dependencies and 2) the lack of a differential amplification mechanism. To address these challenges, we propose a dual-patH composItional coNtextualized neTwork (HINT), which can perform contextualized encoding and amplify the similarity differences between matching and non-matching samples, thus improving the upper performance of CIR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
