Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation
Yichen Yan, Xingjian He, Sihan Chen, Shichen Lu, Jing Liu

TL;DR
This paper introduces FCNet, a bi-directional vision-language framework for referring image segmentation that enhances multi-modal feature fusion through vision-guided initial fusion and language-guided calibration, leading to improved segmentation accuracy.
Contribution
The paper proposes a novel bi-directional guided fusion framework that jointly leverages vision and language for more accurate pixel-level segmentation in RIS.
Findings
Outperforms state-of-the-art on RefCOCO, RefCOCO+, and G-Ref datasets.
Effective multi-modal feature calibration improves segmentation quality.
Bi-directional guidance enhances fine-grained semantic understanding.
Abstract
Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsFocus
