ResVG: Enhancing Relation and Semantic Understanding in Multiple   Instances for Visual Grounding

Minghang Zheng; Jiahua Zhang; Qingchao Chen; Yuxin Peng; Yang Liu

arXiv:2408.16314·cs.CV·August 30, 2024

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Minghang Zheng, Jiahua Zhang, Qingchao Chen, Yuxin Peng, Yang Liu

PDF

1 Repo

TL;DR

ResVG is a novel model that improves visual grounding accuracy in complex scenes with multiple similar objects by integrating semantic priors and relation-sensitive data augmentation.

Contribution

The paper introduces ResVG, which enhances relation and semantic understanding in visual grounding through semantic prior injection and relation-aware data augmentation.

Findings

01

Significant performance improvements on five datasets.

02

Enhanced understanding of fine-grained semantics and spatial relations.

03

Better localization accuracy in multi-instance scenarios.

Abstract

Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same category as the target) remains a significant challenge. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image, indicating an insufficient understanding of the fine-grained semantics and spatial relationships between objects. In this paper, we propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue. Firstly, we enhance the model's understanding of fine-grained semantics by injecting semantic prior information derived from text queries into the model. This is achieved by leveraging text-to-image generation models to produce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

minghangz/resvg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.