TL;DR
This paper introduces C3VG, a two-stage multi-task visual grounding framework that enhances localization and segmentation accuracy by enforcing consistency constraints and leveraging pre-trained multimodal models.
Contribution
The paper proposes a novel coarse-to-fine architecture with explicit consistency constraints and multimodal pre-training to improve multi-task visual grounding performance.
Findings
Significantly outperforms state-of-the-art methods on RefCOCO, RefCOCO+, and RefCOCOg datasets.
Effectively enforces cross-task consistency through novel loss functions.
Leverages pre-trained visual-linguistic models to address understanding limitations.
Abstract
Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture (), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsFocus
