Vision-Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

Jiaqi Guo; Mingzhen Li; Hanyu Su; Santiago L\'opez; Lexiaozi Fan; Daniel Kim; and Aggelos Katsaggelos

arXiv:2511.19759·cs.CV·November 27, 2025

Vision-Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

Jiaqi Guo, Mingzhen Li, Hanyu Su, Santiago L\'opez, Lexiaozi Fan, Daniel Kim, and Aggelos Katsaggelos

PDF

Open Access

TL;DR

This paper introduces VESSA, a vision-language enhanced foundation model that improves semi-supervised medical image segmentation by leveraging visual-semantic understanding and iterative pseudo-label refinement, significantly boosting accuracy with limited annotations.

Contribution

The work presents a novel VLM-based segmentation foundation model integrated into SSL, enabling effective semantic feature matching and iterative pseudo-label refinement for medical image segmentation.

Findings

01

VESSA outperforms existing methods on multiple datasets.

02

Significant accuracy improvements under limited annotations.

03

Effective integration of vision-language models into SSL frameworks.

Abstract

Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning