Efficient Open Set Single Image Test Time Adaptation of Vision Language Models

Manogna Sreenivas; Soma Biswas

arXiv:2406.00481·cs.CV·June 3, 2025

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models

Manogna Sreenivas, Soma Biswas

PDF

Open Access

TL;DR

This paper introduces ROSITA, a novel framework for open-set single-image test-time adaptation of vision-language models, enabling models to adapt to new environments and distinguish known from unknown classes in real-time scenarios.

Contribution

The paper establishes a comprehensive benchmark for open-set TTA and proposes ROSITA, which uses dynamic feature banks and contrastive learning to improve open-set adaptation performance.

Findings

01

ROSITA achieves state-of-the-art results on real-world benchmarks.

02

The method effectively distinguishes known and unknown classes in real-time.

03

ROSITA demonstrates computational efficiency suitable for deployment.

Abstract

Adapting models to dynamic, real-world environments characterized by shifting data distributions and unseen test scenarios is a critical challenge in deep learning. In this paper, we consider a realistic and challenging Test-Time Adaptation setting, where a model must continuously adapt to test samples that arrive sequentially, one at a time, while distinguishing between known and unknown classes. Current Test-Time Adaptation methods operate under closed-set assumptions or batch processing, differing from the real-world open-set scenarios. We address this limitation by establishing a comprehensive benchmark for {\em Open-set Single-image Test-Time Adaptation using Vision-Language Models}. Furthermore, we propose ROSITA, a novel framework that leverages dynamically updated feature banks to identify reliable test samples and employs a contrastive learning objective to improve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training · Contrastive Learning