VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation   with Unsupervised Domain Adaptation

Roberto Alcover-Couso; Marcos Escudero-Vi\~nolo; Juan C. SanMiguel and; Jesus Bescos

arXiv:2412.09240·cs.CV·December 13, 2024

VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation

Roberto Alcover-Couso, Marcos Escudero-Vi\~nolo, Juan C. SanMiguel and, Jesus Bescos

PDF

Open Access

TL;DR

This paper introduces UDA-FROVSS, a novel framework that combines vision-language reasoning with unsupervised domain adaptation techniques to improve open vocabulary segmentation across diverse, unseen domains without shared categories.

Contribution

It presents a new method integrating multi-scale context, prompt augmentation, and layer-wise fine-tuning within a UDA framework to enhance fine-grained segmentation in VLMs.

Findings

01

Improved segmentation accuracy across multiple domains.

02

Effective adaptation without shared category labels.

03

Stable training through distillation and mixed sampling.

Abstract

Segmentation models are typically constrained by the categories defined during training. To address this, researchers have explored two independent approaches: adapting Vision-Language Models (VLMs) and leveraging synthetic data. However, VLMs often struggle with granularity, failing to disentangle fine-grained concepts, while synthetic data-based methods remain limited by the scope of available datasets. This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA). First, we improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary Semantic Segmentation (FROVSS) framework. Next, we incorporate these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications