TL;DR
This paper proposes a novel method for few-shot test-time domain adaptation by learning directly on input space and enhancing text feature diversity, significantly improving CLIP's performance on real-world benchmarks.
Contribution
It introduces a new approach that complements CLIP's frozen features with input space learning and dataset-specific text refinement for better domain adaptation.
Findings
Outperforms state-of-the-art on 5 large-scale benchmarks
Significantly improves performance of smaller networks like ViT-B/16
Achieves +5.1 F1 score on iWildCam and +3.1% WC accuracy on FMoW
Abstract
Few-shot Test-Time Domain Adaptation focuses on adapting a model at test time to a specific domain using only a few unlabeled examples, addressing domain shift. Prior methods leverage CLIP's strong out-of-distribution (OOD) abilities by generating domain-specific prompts to guide its generalized, frozen features. However, since downstream datasets are not explicitly seen by CLIP, solely depending on the feature space knowledge is constrained by CLIP's prior knowledge. Notably, when using a less robust backbone like ViT-B/16, performance significantly drops on challenging real-world benchmarks. Departing from the state-of-the-art of inheriting the intrinsic OOD capability of CLIP, this work introduces learning directly on the input space to complement the dataset-specific knowledge for frozen CLIP. Specifically, an independent side branch is attached in parallel with CLIP and enforced to…
Peer Reviews
Decision·ICLR 2025 Poster
1. The overall writing is good and clear. 2. Promising results are obtained in the studied benchmarks. 3. Comprehensive evaluations are conducted, and the effectiveness of the proposed modules is also shown.
Novelty is somewhat limited. I appreciate the proposed method improved the base CDPG obviously and the proposed method (as in Fig.2) also includes many submodules. However, except for the greedy text ensemble strategy, the techniques used in the submodels e.g., domain prompt, and cross-attention-based fusion all look not new.
1. The research task Few-shot Test-Time Domain Adaptation seems practical and meaningful. 2. With less robust backbones like ViT-B/16, the proposed method L2C shows significant performance improvement.
I have some questions about this paper that need further discussion. Please see them below. If the authors can address my concerns, I am willing to raise my score.
1. The experimental results show the improvement over state-of-the-art methods. 2. This paper is well written and follows a good structure. 3. The supplementary material is extensive, offering a lot of supplements and support to the main text.
1. The motivation behind the proposed method is not clearly explained. As shown in Figure 1, The method is organized in two branches, fusing domain prompt and complementing dataset-specific knowledge. However, these two branches seem to have very similar functions, i.e., incorporating domain-specific knowledge, which makes the whole pipeline seem redundant. 2. As shown in experiments, it seems that CPNet plays an important role. However, it is not clear that whether the improvement is due to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
