SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation
Ashfak Yeafi, Parthaw Goswami, Md Khairul Islam, Ashifa Islam Shamme

TL;DR
SwinTextUNet is a multimodal medical image segmentation model that integrates CLIP-based textual guidance with a Swin Transformer U-Net, improving accuracy and robustness in challenging imaging scenarios.
Contribution
This work introduces a novel multimodal framework combining CLIP text embeddings with Swin Transformer U-Net for enhanced medical image segmentation.
Findings
Achieved Dice score of 86.47% on QaTaCOV19 dataset.
Demonstrated the effectiveness of text guidance and multimodal fusion.
Validated the model's robustness through ablation studies.
Abstract
Precise medical image segmentation is fundamental for enabling computer aided diagnosis and effective treatment planning. Traditional models that rely solely on visual features often struggle when confronted with ambiguous or low contrast patterns. To overcome these limitations, we introduce SwinTextUNet, a multimodal segmentation framework that incorporates Contrastive Language Image Pretraining (CLIP), derived textual embeddings into a Swin Transformer UNet backbone. By integrating cross attention and convolutional fusion, the model effectively aligns semantic text guidance with hierarchical visual representations, enhancing robustness and accuracy. We evaluate our approach on the QaTaCOV19 dataset, where the proposed four stage variant achieves an optimal balance between performance and complexity, yielding Dice and IoU scores of 86.47% and 78.2%, respectively. Ablation studies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
