VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation
Roberto Alcover-Couso, Marcos Escudero-Vi\~nolo, Juan C. SanMiguel and, Jesus Bescos

TL;DR
This paper introduces UDA-FROVSS, a novel framework that combines vision-language reasoning with unsupervised domain adaptation techniques to improve open vocabulary segmentation across diverse, unseen domains without shared categories.
Contribution
It presents a new method integrating multi-scale context, prompt augmentation, and layer-wise fine-tuning within a UDA framework to enhance fine-grained segmentation in VLMs.
Findings
Improved segmentation accuracy across multiple domains.
Effective adaptation without shared category labels.
Stable training through distillation and mixed sampling.
Abstract
Segmentation models are typically constrained by the categories defined during training. To address this, researchers have explored two independent approaches: adapting Vision-Language Models (VLMs) and leveraging synthetic data. However, VLMs often struggle with granularity, failing to disentangle fine-grained concepts, while synthetic data-based methods remain limited by the scope of available datasets. This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA). First, we improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary Semantic Segmentation (FROVSS) framework. Next, we incorporate these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
