IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han

TL;DR
IMRAM introduces an iterative, memory-augmented approach to improve fine-grained cross-modal image-text retrieval by progressively refining alignments, achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes IMRAM, a novel iterative matching framework with recurrent attention memory that better captures complex semantic correspondences between images and texts.
Findings
Achieves state-of-the-art performance on Flickr8K, Flickr30K, and MS COCO datasets.
Effectively refines alignments through multiple iterative steps.
Demonstrates practical applicability on a business advertisement dataset.
Abstract
Enabling bi-directional retrieval of images and texts is important for understanding the correspondence between vision and language. Existing methods leverage the attention mechanism to explore such correspondence in a fine-grained manner. However, most of them consider all semantics equally and thus align them uniformly, regardless of their diverse complexities. In fact, semantics are diverse (i.e. involving different kinds of semantic concepts), and humans usually follow a latent structure to combine them into understandable languages. It may be difficult to optimally capture such sophisticated correspondences in existing methods. In this paper, to address such a deficiency, we propose an Iterative Matching with Recurrent Attention Memory (IMRAM) method, in which correspondences between images and texts are captured with multiple steps of alignments. Specifically, we introduce an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
