Real-Time Referring Expression Comprehension by Single-Stage Grounding Network
Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, Jiebo Luo

TL;DR
This paper introduces a single-stage, end-to-end model called SSG for real-time referring expression comprehension, achieving high accuracy and efficiency without relying on region proposals.
Contribution
The novel SSG model simplifies the process by removing region proposals and introduces a guided attention mechanism and attribute prediction for improved localization.
Findings
Achieves state-of-the-art performance on ReferItGame dataset.
Runs in 25ms per image, significantly faster than previous models.
Performs comparably or better than multi-stage models on benchmark datasets.
Abstract
In this paper, we propose a novel end-to-end model, namely Single-Stage Grounding network (SSG), to localize the referent given a referring expression within an image. Different from previous multi-stage models which rely on object proposals or detected regions, our proposed model aims to comprehend a referring expression through one single stage without resorting to region proposals as well as the subsequent region-wise feature extraction. Specifically, a multimodal interactor is proposed to summarize the local region features regarding the referring expression attentively. Subsequently, a grounder is proposed to localize the referring expression within the given image directly. For further improving the localization accuracy, a guided attention mechanism is proposed to enforce the grounder to focus on the central region of the referent. Moreover, by exploiting and predicting visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
