Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision
Menghao Li, Chunlei Wang, Wenquan Feng, Shuchang Lyu, Guangliang, Cheng, Xiangtai Li, Binghao Liu, Qi Zhao

TL;DR
This paper introduces IR-VG, a novel framework for visual grounding that enhances localization accuracy and robustness against inaccurate descriptions through iterative fusion, masked supervision, and false-alarm prevention, achieving state-of-the-art results.
Contribution
The paper proposes a new IR-VG framework with MRCS, IMVF, and MFSD components, advancing robustness and precision in visual grounding tasks.
Findings
Achieves new SOTA on robust VG datasets with 25% and 10% improvements.
Effective on five regular VG datasets, demonstrating versatility.
Introduces multi-stage false-alarm sensitive decoding for better accuracy.
Abstract
Visual Grounding (VG) aims at localizing target objects from an image based on given expressions and has made significant progress with the development of detection and vision transformer. However, existing VG methods tend to generate false-alarm objects when presented with inaccurate or irrelevant descriptions, which commonly occur in practical applications. Moreover, existing methods fail to capture fine-grained features, accurate localization, and sufficient context comprehension from the whole image and textual descriptions. To address both issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS). The framework introduces iterative multi-level vision-language fusion (IMVF) for better alignment. We use MRCS to ahieve more accurate localization with point-wised feature supervision. Then, to improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
Methodsfail
