Advancing Visual Grounding with Scene Knowledge: Benchmark and Method
Zhihong Chen, Ruifei Zhang, Yibing Song, Xiang Wan, Guanbin Li

TL;DR
This paper introduces a new scene knowledge-guided visual grounding benchmark that challenges models to perform reasoning over scene knowledge, highlighting the need for improved interpretability and performance in vision-language understanding.
Contribution
The paper proposes a novel benchmark SK-VG that requires reasoning over scene knowledge and introduces two methods for integrating knowledge into visual grounding models.
Findings
Proposed approaches achieve promising results on SK-VG
Models still have significant room for improvement in performance
The benchmark emphasizes the importance of scene knowledge reasoning in VG
Abstract
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study~\cite{luo2022goes}, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of \underline{S}cene \underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
