Proposal-free One-stage Referring Expression via Grid-Word   Cross-Attention

Wei Suo; Mengyang Sun; Peng Wang; Qi Wu

arXiv:2105.02061·cs.CV·May 6, 2021

Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention

Wei Suo, Mengyang Sun, Peng Wang, Qi Wu

PDF

Open Access

TL;DR

This paper introduces a proposal-free, one-stage model for referring expression comprehension that directly predicts regions from dense image grids using cross-attention, achieving state-of-the-art results efficiently.

Contribution

The proposed model eliminates the need for anchor proposals and hyper-parameters, enabling end-to-end region prediction from dense grids with a cross-attention transformer.

Findings

01

Achieves state-of-the-art performance on four datasets.

02

Higher efficiency compared to previous methods.

03

Eliminates anchor proposal process.

Abstract

Referring Expression Comprehension (REC) has become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering. However, it has not been widely used in many downstream tasks because it suffers 1) two-stage methods exist heavy computation cost and inevitable error accumulation, and 2) one-stage methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) model that is able to regress the region-of-interest from the image, based on a textual query, in an end-to-end manner. Instead of using the dominant anchor proposal fashion, we directly take the dense-grid of an image as input for a cross-attention transformer that learns grid-word correspondences. The final bounding box is predicted directly from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning