Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

Joel Valdivia Ortega; Tingying Peng; Marion Jasnin

arXiv:2605.16393·cs.CV·May 19, 2026

Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

Joel Valdivia Ortega, Tingying Peng, Marion Jasnin

PDF

TL;DR

This paper introduces ViTC-UNet, a model combining Vision Transformers and UNet architecture, to improve biomedical semantic segmentation by leveraging global priors and local details without fine-tuning ViTs.

Contribution

The paper proposes a novel structure-conditioned UNet that uses frozen pre-trained ViT representations, enhancing segmentation accuracy in biomedical imaging without end-to-end ViT fine-tuning.

Findings

01

ViTC-UNet outperforms baseline models in MRI and CT segmentation tasks.

02

Combining ViT priors with UNet improves high-precision biomedical masks.

03

The approach is effective across different imaging modalities.

Abstract

Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.