DOTA: Distributional Test-Time Adaptation of Vision-Language Models
Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, Changqing Zhang

TL;DR
DOTA introduces a distribution-based test-time adaptation method for vision-language models that continuously estimates test data distributions to improve robustness and reduce catastrophic forgetting during deployment.
Contribution
It proposes a novel distribution-centric approach for test-time adaptation, addressing cache management limitations and enhancing model robustness in deployment.
Findings
DOTA outperforms existing test-time adaptation methods.
It significantly reduces catastrophic forgetting.
Achieves state-of-the-art performance on benchmark datasets.
Abstract
Vision-language foundation models (VLMs), such as CLIP, exhibit remarkable performance across a wide range of tasks. However, deploying these models can be unreliable when significant distribution gaps exist between training and test data, while fine-tuning for diverse scenarios is often costly. Cache-based test-time adapters offer an efficient alternative by storing representative test samples to guide subsequent classifications. Yet, these methods typically employ naive cache management with limited capacity, leading to severe catastrophic forgetting when samples are inevitably dropped during updates. In this paper, we propose DOTA (DistributiOnal Test-time Adaptation), a simple yet effective method addressing this limitation. Crucially, instead of merely memorizing individual test samples, DOTA continuously estimates the underlying distribution of the test data stream. Test-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsAdapter · Contrastive Language-Image Pre-training
