Cross-Modal Progressive Comprehension for Referring Segmentation
Si Liu, Tianrui Hui, Shaofei Huang, Yunchao Wei, Bo Li, Guanbin Li

TL;DR
This paper introduces a progressive cross-modal comprehension scheme for referring segmentation, mimicking human reasoning by sequentially focusing on candidate entities and their relations, leading to state-of-the-art results in image and video segmentation.
Contribution
It proposes a novel Cross-Modal Progressive Comprehension (CMPC) framework with modules for images and videos, enhancing feature interaction and reasoning for improved segmentation accuracy.
Findings
Achieves new state-of-the-art on four image segmentation benchmarks.
Achieves new state-of-the-art on three video segmentation benchmarks.
Effectively models human-like progressive reasoning in multimodal understanding.
Abstract
Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
