SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation

Ashfak Yeafi; Parthaw Goswami; Md Khairul Islam; Ashifa Islam Shamme

arXiv:2604.10000·cs.CV·April 14, 2026

SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation

Ashfak Yeafi, Parthaw Goswami, Md Khairul Islam, Ashifa Islam Shamme

PDF

TL;DR

SwinTextUNet is a multimodal medical image segmentation model that integrates CLIP-based textual guidance with a Swin Transformer U-Net, improving accuracy and robustness in challenging imaging scenarios.

Contribution

This work introduces a novel multimodal framework combining CLIP text embeddings with Swin Transformer U-Net for enhanced medical image segmentation.

Findings

01

Achieved Dice score of 86.47% on QaTaCOV19 dataset.

02

Demonstrated the effectiveness of text guidance and multimodal fusion.

03

Validated the model's robustness through ablation studies.

Abstract

Precise medical image segmentation is fundamental for enabling computer aided diagnosis and effective treatment planning. Traditional models that rely solely on visual features often struggle when confronted with ambiguous or low contrast patterns. To overcome these limitations, we introduce SwinTextUNet, a multimodal segmentation framework that incorporates Contrastive Language Image Pretraining (CLIP), derived textual embeddings into a Swin Transformer UNet backbone. By integrating cross attention and convolutional fusion, the model effectively aligns semantic text guidance with hierarchical visual representations, enhancing robustness and accuracy. We evaluate our approach on the QaTaCOV19 dataset, where the proposed four stage variant achieves an optimal balance between performance and complexity, yielding Dice and IoU scores of 86.47% and 78.2%, respectively. Ablation studies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.