Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai; Jian Li; Jiedong Zhuang; Xian Zhang; Wankou Yang

arXiv:2501.06710·cs.CV·January 14, 2025

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang

PDF

1 Repo 1 Video

TL;DR

This paper introduces C3VG, a two-stage multi-task visual grounding framework that enhances localization and segmentation accuracy by enforcing consistency constraints and leveraging pre-trained multimodal models.

Contribution

The paper proposes a novel coarse-to-fine architecture with explicit consistency constraints and multimodal pre-training to improve multi-task visual grounding performance.

Findings

01

Significantly outperforms state-of-the-art methods on RefCOCO, RefCOCO+, and RefCOCOg datasets.

02

Effectively enforces cross-task consistency through novel loss functions.

03

Leverages pre-trained visual-linguistic models to address understanding limitations.

Abstract

Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture ( $C^{3} VG$ ), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dmmm1997/c3vg
pytorchOfficial

Videos

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints· underline

Taxonomy

MethodsFocus