Referencing Where to Focus: Improving VisualGrounding with Referential Query
Yabing Wang, Zhuotao Tian, Qingpei Guo, Zheng Qin, Sanping, Zhou, Ming Yang, Le Wang

TL;DR
This paper introduces RefFormer, a novel visual grounding method that improves target localization by using referential queries generated through a query adaption module, enhancing learning efficiency and accuracy.
Contribution
The paper proposes a query adaption module integrated with CLIP to generate referential queries, reducing learning difficulty and leveraging multi-level image features for better grounding performance.
Findings
Outperforms state-of-the-art on five benchmarks.
Effectively mitigates learning difficulty in query generation.
Preserves CLIP knowledge without backbone tuning.
Abstract
Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Virtual Reality Applications and Impacts · Visual and Cognitive Learning Processes
MethodsContrastive Language-Image Pre-training
