Recurrent Multimodal Interaction for Referring Image Segmentation
Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Alan Yuille

TL;DR
This paper introduces a convolutional multimodal LSTM that jointly models language and visual information for referring image segmentation, outperforming previous methods by capturing sequential word-image interactions.
Contribution
The paper proposes a novel convolutional multimodal LSTM that effectively models sequential interactions between words and images for improved segmentation.
Findings
Outperforms baseline models on benchmark datasets
Provides analysis of word-image interaction mechanisms
Demonstrates more effective multimodal encoding
Abstract
In this paper we are interested in the problem of image segmentation given natural language descriptions, i.e. referring expressions. Existing works tackle this problem by first modeling images and sentences independently and then segment images by combining these two types of representations. We argue that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and we propose convolutional multimodal LSTM to encode the sequential interactions between individual words, visual information, and spatial information. We show that our proposed model outperforms the baseline model on benchmark datasets. In addition, we analyze the intermediate output of the proposed multimodal LSTM approach and empirically explain how this approach enforces a more effective word-to-image interaction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
