Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Zhihong Chen; Ruifei Zhang; Yibing Song; Xiang Wan; Guanbin Li

arXiv:2307.11558·cs.CV·July 24, 2023·1 cites

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Zhihong Chen, Ruifei Zhang, Yibing Song, Xiang Wan, Guanbin Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new scene knowledge-guided visual grounding benchmark that challenges models to perform reasoning over scene knowledge, highlighting the need for improved interpretability and performance in vision-language understanding.

Contribution

The paper proposes a novel benchmark SK-VG that requires reasoning over scene knowledge and introduces two methods for integrating knowledge into visual grounding models.

Findings

01

Proposed approaches achieve promising results on SK-VG

02

Models still have significant room for improvement in performance

03

The benchmark emphasizes the importance of scene knowledge reasoning in VG

Abstract

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study~\cite{luo2022goes}, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of \underline{S}cene \underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhjohnchan/sk-vg
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning