Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention
Wei Suo, Mengyang Sun, Peng Wang, Qi Wu

TL;DR
This paper introduces a proposal-free, one-stage model for referring expression comprehension that directly predicts regions from dense image grids using cross-attention, achieving state-of-the-art results efficiently.
Contribution
The proposed model eliminates the need for anchor proposals and hyper-parameters, enabling end-to-end region prediction from dense grids with a cross-attention transformer.
Findings
Achieves state-of-the-art performance on four datasets.
Higher efficiency compared to previous methods.
Eliminates anchor proposal process.
Abstract
Referring Expression Comprehension (REC) has become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering. However, it has not been widely used in many downstream tasks because it suffers 1) two-stage methods exist heavy computation cost and inevitable error accumulation, and 2) one-stage methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) model that is able to regress the region-of-interest from the image, based on a textual query, in an end-to-end manner. Instead of using the dominant anchor proposal fashion, we directly take the dense-grid of an image as input for a cross-attention transformer that learns grid-word correspondences. The final bounding box is predicted directly from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
