Learning with Multi-modal Gradient Attention for Explainable Composed Image Retrieval
Prateksha Udhayanan, Srikrishna Karanam, and Balaji Vasan Srinivasan

TL;DR
This paper introduces a novel multi-modal gradient attention mechanism for composed image retrieval, improving local feature focus, explainability, and retrieval accuracy by explicitly guiding models to localize relevant image regions based on modification texts.
Contribution
The paper proposes a new gradient-attention-based learning objective and visual attention computation technique called MMGrad, enhancing local feature learning and explainability in composed image retrieval.
Findings
Improved localization of modified regions in images.
Enhanced explainability through better attention maps.
Competitive retrieval performance on benchmark datasets.
Abstract
We consider the problem of composed image retrieval that takes an input query consisting of an image and a modification text indicating the desired changes to be made on the image and retrieves images that match these changes. Current state-of-the-art techniques that address this problem use global features for the retrieval, resulting in incorrect localization of the regions of interest to be modified because of the global nature of the features, more so in cases of real-world, in-the-wild images. Since modifier texts usually correspond to specific local changes in an image, it is critical that models learn local features to be able to both localize and retrieve better. To this end, our key novelty is a new gradient-attention-based learning objective that explicitly forces the model to focus on the local regions of interest being modified in each retrieval step. We achieve this by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsFocus
