DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction
Jing Wang, Huimin Shi, Quan Zhou, Qibo Liu, Suofei Zhang, Huimin Lu

TL;DR
DCP-CLIP introduces a coarse-to-fine framework for open-vocabulary semantic segmentation that dynamically constructs textual features and models dual interactions, improving accuracy and efficiency over existing methods.
Contribution
The paper proposes a novel dynamic textual feature construction and dual interaction modeling framework for OVSS, addressing cross-modal communication and computational efficiency issues.
Findings
Outperforms existing OVSS methods in accuracy
Achieves higher efficiency in semantic segmentation
Demonstrates effectiveness on multiple benchmarks
Abstract
The recent years have witnessed the remarkable development for open-vocabulary semantic segmentation (OVSS) using visual-language foundation models, yet still suffer from following fundamental challenges: (1) insufficient cross-modal communications between textual and visual spaces, and (2) significant computational costs from the interactions with massive number of categories. To address these issues, this paper describes a novel coarse-to-fine framework, called DCP-CLIP, for OVSS. Unlike prior efforts that mainly relied on pre-established category content and the inherent spatial-class interaction capability of CLIP, we dynamic constructing category-relevant textual features and explicitly models dual interactions between spatial image features and textual class semantics. Specifically, we first leverage CLIP's open-vocabulary recognition capability to identify semantic categories…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
