Exploring the Benefits of Vision Foundation Models for Unsupervised Domain Adaptation
Brun\'o B. Englert, Fabrizio J. Piva, Tommie Kerssies, Daan de Geus,, Gijs Dubbelman

TL;DR
This paper demonstrates that combining Vision Foundation Models with Unsupervised Domain Adaptation significantly improves semantic segmentation performance and inference speed, establishing new benchmarks and efficiencies in the field.
Contribution
It introduces a method that integrates VFMs with UDA, achieving faster inference and better accuracy, setting new standards for domain adaptation in computer vision.
Findings
8.4× speedup over previous methods
+1.2 mIoU improvement in UDA performance
+6.1 mIoU in out-of-distribution generalization
Abstract
Achieving robust generalization across diverse data domains remains a significant challenge in computer vision. This challenge is important in safety-critical applications, where deep-neural-network-based systems must perform reliably under various environmental conditions not seen during training. Our study investigates whether the generalization capabilities of Vision Foundation Models (VFMs) and Unsupervised Domain Adaptation (UDA) methods for the semantic segmentation task are complementary. Results show that combining VFMs with UDA has two main benefits: (a) it allows for better UDA performance while maintaining the out-of-distribution performance of VFMs, and (b) it makes certain time-consuming UDA components redundant, thus enabling significant inference speedups. Specifically, with equivalent model sizes, the resulting VFM-UDA method achieves an 8.4 speed increase over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
