Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching
Tianlang Chen, Jiebo Luo

TL;DR
This paper introduces a dual path recurrent neural network that reorders image objects based on related words and uses high-level features with attention mechanisms to improve image-text matching accuracy.
Contribution
The paper proposes a novel DP-RNN model that processes images and texts symmetrically, capturing semantic relations between objects for better matching performance.
Findings
Achieves state-of-the-art results on Flickr30K dataset.
Demonstrates competitive performance on MS-COCO dataset.
Validates the effectiveness of high-level object features and attention mechanisms.
Abstract
Existing image-text matching approaches typically infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. However, they ignore the connections between the objects that are semantically related. These objects may collectively determine whether the image corresponds to a text or not. To address this problem, we propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN). In particular, given an input image-text pair, our model reorders the image objects based on the positions of their most related words in the text. In the same way as extracting the hidden features from word embeddings, the model leverages RNN to extract high-level object features from the reordered object inputs. We validate that the high-level object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
