TL;DR
ResVG is a novel model that improves visual grounding accuracy in complex scenes with multiple similar objects by integrating semantic priors and relation-sensitive data augmentation.
Contribution
The paper introduces ResVG, which enhances relation and semantic understanding in visual grounding through semantic prior injection and relation-aware data augmentation.
Findings
Significant performance improvements on five datasets.
Enhanced understanding of fine-grained semantics and spatial relations.
Better localization accuracy in multi-instance scenarios.
Abstract
Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same category as the target) remains a significant challenge. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image, indicating an insufficient understanding of the fine-grained semantics and spatial relationships between objects. In this paper, we propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue. Firstly, we enhance the model's understanding of fine-grained semantics by injecting semantic prior information derived from text queries into the model. This is achieved by leveraging text-to-image generation models to produce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
