Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for   Referring Image Segmentation

Yichen Yan; Xingjian He; Sihan Chen; Shichen Lu; Jing Liu

arXiv:2405.11205·cs.CV·May 21, 2024

Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

Yichen Yan, Xingjian He, Sihan Chen, Shichen Lu, Jing Liu

PDF

Open Access

TL;DR

This paper introduces FCNet, a bi-directional vision-language framework for referring image segmentation that enhances multi-modal feature fusion through vision-guided initial fusion and language-guided calibration, leading to improved segmentation accuracy.

Contribution

The paper proposes a novel bi-directional guided fusion framework that jointly leverages vision and language for more accurate pixel-level segmentation in RIS.

Findings

01

Outperforms state-of-the-art on RefCOCO, RefCOCO+, and G-Ref datasets.

02

Effective multi-modal feature calibration improves segmentation quality.

03

Bi-directional guidance enhances fine-grained semantic understanding.

Abstract

Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsFocus