CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation
Jiong Wu, Yang Xing, Boxiao Yu, Wei Shao, and Kuang Gong

TL;DR
This paper introduces CDPDNet, a novel medical image segmentation framework that combines vision transformers, CLIP-based text embeddings, and task-specific prompts to improve segmentation accuracy and generalizability on partially labeled datasets.
Contribution
The study proposes a new CLIP-DINO prompt-driven segmentation network integrating vision transformers, text embeddings, and task prompts to address partial labels and complex anatomical relationships.
Findings
Outperforms existing segmentation methods on multiple datasets.
Effectively models complex organ and tumor relationships.
Enhances generalization to unseen datasets.
Abstract
Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · AI in cancer detection · Medical Imaging and Analysis
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Concatenated Skip Connection · Dense Connections · Vision Transformer
