You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding
Chaorui Deng, Qi Wu, Guanghui Xu, Zhuliang Yu, Yanwu Xu, Kui Jia,, Mingkui Tan

TL;DR
This paper introduces a fast, one-stage visual grounding method that directly predicts relevant image regions based on natural language queries, significantly improving speed and accuracy over traditional two-stage approaches.
Contribution
The paper presents a novel one-stage detection network with Relation-to-Attention modules that jointly perform region proposal and matching, enhancing efficiency and performance.
Findings
20x to 30x faster inference than previous methods
Achieves 18% to 41% absolute performance improvement
Operates effectively without large numbers of region proposals
Abstract
Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a more useful technique than object detection in practice. Most state-of-the-art methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem that finds the best match between the language query and all region proposals. This is rather inefficient because there might be hundreds of proposals produced in the first stage that need to be compared in the second stage, not to mention this strategy performs inaccurately. In this paper, we propose an simple, intuitive and much more elegant one-stage detection based method that joints the region proposal and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
