Iterative Robust Visual Grounding with Masked Reference based   Centerpoint Supervision

Menghao Li; Chunlei Wang; Wenquan Feng; Shuchang Lyu; Guangliang; Cheng; Xiangtai Li; Binghao Liu; Qi Zhao

arXiv:2307.12392·cs.CV·July 25, 2023

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Menghao Li, Chunlei Wang, Wenquan Feng, Shuchang Lyu, Guangliang, Cheng, Xiangtai Li, Binghao Liu, Qi Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces IR-VG, a novel framework for visual grounding that enhances localization accuracy and robustness against inaccurate descriptions through iterative fusion, masked supervision, and false-alarm prevention, achieving state-of-the-art results.

Contribution

The paper proposes a new IR-VG framework with MRCS, IMVF, and MFSD components, advancing robustness and precision in visual grounding tasks.

Findings

01

Achieves new SOTA on robust VG datasets with 25% and 10% improvements.

02

Effective on five regular VG datasets, demonstrating versatility.

03

Introduces multi-stage false-alarm sensitive decoding for better accuracy.

Abstract

Visual Grounding (VG) aims at localizing target objects from an image based on given expressions and has made significant progress with the development of detection and vision transformer. However, existing VG methods tend to generate false-alarm objects when presented with inaccurate or irrelevant descriptions, which commonly occur in practical applications. Moreover, existing methods fail to capture fine-grained features, accurate localization, and sufficient context comprehension from the whole image and textual descriptions. To address both issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS). The framework introduces iterative multi-level vision-language fusion (IMVF) for better alignment. We use MRCS to ahieve more accurate localization with point-wised feature supervision. Then, to improve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cv516buaa/ir-vg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

Methodsfail