Referencing Where to Focus: Improving VisualGrounding with Referential   Query

Yabing Wang; Zhuotao Tian; Qingpei Guo; Zheng Qin; Sanping; Zhou; Ming Yang; Le Wang

arXiv:2412.19155·cs.CV·December 30, 2024

Referencing Where to Focus: Improving VisualGrounding with Referential Query

Yabing Wang, Zhuotao Tian, Qingpei Guo, Zheng Qin, Sanping, Zhou, Ming Yang, Le Wang

PDF

Open Access

TL;DR

This paper introduces RefFormer, a novel visual grounding method that improves target localization by using referential queries generated through a query adaption module, enhancing learning efficiency and accuracy.

Contribution

The paper proposes a query adaption module integrated with CLIP to generate referential queries, reducing learning difficulty and leveraging multi-level image features for better grounding performance.

Findings

01

Outperforms state-of-the-art on five benchmarks.

02

Effectively mitigates learning difficulty in query generation.

03

Preserves CLIP knowledge without backbone tuning.

Abstract

Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics · Virtual Reality Applications and Impacts · Visual and Cognitive Learning Processes

MethodsContrastive Language-Image Pre-training