Efficient Open Set Single Image Test Time Adaptation of Vision Language Models
Manogna Sreenivas, Soma Biswas

TL;DR
This paper introduces ROSITA, a novel framework for open-set single-image test-time adaptation of vision-language models, enabling models to adapt to new environments and distinguish known from unknown classes in real-time scenarios.
Contribution
The paper establishes a comprehensive benchmark for open-set TTA and proposes ROSITA, which uses dynamic feature banks and contrastive learning to improve open-set adaptation performance.
Findings
ROSITA achieves state-of-the-art results on real-world benchmarks.
The method effectively distinguishes known and unknown classes in real-time.
ROSITA demonstrates computational efficiency suitable for deployment.
Abstract
Adapting models to dynamic, real-world environments characterized by shifting data distributions and unseen test scenarios is a critical challenge in deep learning. In this paper, we consider a realistic and challenging Test-Time Adaptation setting, where a model must continuously adapt to test samples that arrive sequentially, one at a time, while distinguishing between known and unknown classes. Current Test-Time Adaptation methods operate under closed-set assumptions or batch processing, differing from the real-world open-set scenarios. We address this limitation by establishing a comprehensive benchmark for {\em Open-set Single-image Test-Time Adaptation using Vision-Language Models}. Furthermore, we propose ROSITA, a novel framework that leverages dynamically updated feature banks to identify reliable test samples and employs a contrastive learning objective to improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training · Contrastive Learning
