CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation

Jiong Wu; Yang Xing; Boxiao Yu; Wei Shao; and Kuang Gong

arXiv:2505.18958·cs.CV·May 28, 2025

CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation

Jiong Wu, Yang Xing, Boxiao Yu, Wei Shao, and Kuang Gong

PDF

Open Access 1 Repo

TL;DR

This paper introduces CDPDNet, a novel medical image segmentation framework that combines vision transformers, CLIP-based text embeddings, and task-specific prompts to improve segmentation accuracy and generalizability on partially labeled datasets.

Contribution

The study proposes a new CLIP-DINO prompt-driven segmentation network integrating vision transformers, text embeddings, and task prompts to address partial labels and complex anatomical relationships.

Findings

01

Outperforms existing segmentation methods on multiple datasets.

02

Effectively models complex organ and tumor relationships.

03

Enhances generalization to unseen datasets.

Abstract

Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wujiong-hub/cdpdnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiomics and Machine Learning in Medical Imaging · AI in cancer detection · Medical Imaging and Analysis

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Concatenated Skip Connection · Dense Connections · Vision Transformer