You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding

Chaorui Deng; Qi Wu; Guanghui Xu; Zhuliang Yu; Yanwu Xu; Kui Jia,; Mingkui Tan

arXiv:1902.04213·cs.CV·March 19, 2019·1 cites

You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding

Chaorui Deng, Qi Wu, Guanghui Xu, Zhuliang Yu, Yanwu Xu, Kui Jia,, Mingkui Tan

PDF

Open Access

TL;DR

This paper introduces a fast, one-stage visual grounding method that directly predicts relevant image regions based on natural language queries, significantly improving speed and accuracy over traditional two-stage approaches.

Contribution

The paper presents a novel one-stage detection network with Relation-to-Attention modules that jointly perform region proposal and matching, enhancing efficiency and performance.

Findings

01

20x to 30x faster inference than previous methods

02

Achieves 18% to 41% absolute performance improvement

03

Operates effectively without large numbers of region proposals

Abstract

Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a more useful technique than object detection in practice. Most state-of-the-art methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem that finds the best match between the language query and all region proposals. This is rather inefficient because there might be hundreds of proposals produced in the first stage that need to be compared in the second stage, not to mention this strategy performs inaccurately. In this paper, we propose an simple, intuitive and much more elegant one-stage detection based method that joints the region proposal and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning