Flatness Guided Test-Time Adaptation for Vision-Language Models
Aodi Li, Liansheng Zhuang, Xiao Long, Houqiang Li, Shafei Wang

TL;DR
This paper introduces a flatness-guided test-time adaptation framework for vision-language models that leverages training flat minima to improve adaptation efficiency and performance during distribution shifts.
Contribution
It proposes a novel FGA framework that unifies training and test-time adaptation by using flatness cues from sharpness-aware training, reducing computational costs.
Findings
FGA outperforms existing TTA methods on multiple benchmarks.
FGA achieves a 4.88% average improvement on ImageNet out-of-domain variants.
FGA avoids expensive prompt updates during test time.
Abstract
Test-time adaptation (TTA) of Vision-Language Models (VLMs) has emerged as a technique for tackling distribution shifts during the test time. Recent research indicates that the test-time adaptation is intrinsically linked to the model's training history. However, existing TTA methods, such as Test-time Prompt Tuning, often design adaptation strategies in isolation from the models' training characteristics, which degrade their performance. This paper argues that the flatness acquired via sharpness-aware training is an efficient clue for the test-time adaptation of VLMs. Built on this insight, this paper proposes a novel Flatness-Guided Adaptation framework (FGA) for VLMs to cohesively unify training and test-time procedures. Its core idea is to leverage the alignment between the training minimum and test loss flat regions to guide the adaptation process. Specifically, our FGA consists of…
Peer Reviews
Decision·ICLR 2026 Poster
1. To my knowledge, this is the first work tackling TTA (and prompt tuning) for vision-language models from the point of view of flat minima. This perspective on the problem is sound, supported by both findings of previous works (e.g., Niu et al. 2023, Gong et al. 2023) as well as the theoretical analysis of Sec. 4, and the achieved experimental results. The paper confirms the potential of this strategy, proposing variants suitable for the increased representation capabilities of VLMs. 2. The s
**1.** While FGA is an interesting approach, its two components could be techniques applied independently for prompt-tuning (SAPT) and for sample selection (STSS). Specifically, these two approaches might be treated independently from each other, as SAPT is a strategy for prompt tuning while STSS is a strategy for TTA with sample selection. Thus:\ **1.1** SAPT could be, in principle, compared with CoOp and related variants on prompt-tuning benchmarks (e.g., base vs novel categ
1. The paper introduces a FGA method that cohesively links training and test-time adaptation through the concept of loss landscape flatness, improving robustness to distribution shifts without updating model parameters during inference. 2. FGA avoids backpropagation and prompt updates at test time, resulting in lower computational overhead while still improving performance.
1. The scope of the paper is mismatched: While the title refers to test-time adaptation (TTA) for ``vision-language models" broadly, all experiments are limited to CLIP variants (ViT-B/16 and ResNet50). This narrow scope excludes newer or structurally different VLMs such as LLAVA or CLIP variants such as SigLIP, SigLIP-v2, or other similar models, undermining the generality claim. 2. The phrase ``training data on downstream task" is vague. From the experimental section, it seems the prompt tun
- **S1.** The paper proposes a novel TTA paradigm that eliminates the need for backpropagation during testing. This is innovative. - **S2.** The paper theoretically discusses the relationship between sharpness and generalization performance on the loss landscape, justifying the effectiveness of the proposed method. If this analysis is novel, it is expected to have a significant impact on the field. - **S3.** The paper evaluates the performance and inference speed improvements achieved by the pro
- **W1.** Section 4's theoretical analysis discusses generalization performance using the loss $l^\rho$ (e.g., cross-entropy for classification), but the proposed method shown in Section 3 actually evaluates sharpness using the surrogate loss $l_\text{SRG}$, causing a discrepancy with reality and theory. The paper should add theoretical supplements describing the effects by this gap. Furthermore, practical insights such as the correlation between sharpness computed via cross-entropy using ground
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
