DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

Ziyu Zhao; Xiaoguang Li; Linjia Shi; Nasrin Imanpour; Song Wang

arXiv:2505.11676·cs.CV·May 20, 2025

DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

Ziyu Zhao, Xiaoguang Li, Linjia Shi, Nasrin Imanpour, Song Wang

PDF

Open Access

TL;DR

DPSeg introduces a dual-prompt framework that enhances open-vocabulary semantic segmentation by reducing domain gaps and leveraging multi-level features, leading to superior performance on public datasets.

Contribution

The paper proposes a novel dual prompting framework with cost volume learning and semantic-guided prompt refinement for improved open-vocabulary segmentation.

Findings

01

Outperforms state-of-the-art methods on multiple datasets.

02

Effectively reduces domain gap between text and visual embeddings.

03

Enhances detection of small objects and fine details.

Abstract

Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre-trained vision-language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text-aligned features limits shallow-level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy. To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual-prompt cost volume generation, a cost volume-guided decoder, and a semantic-guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training