Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation
Zixin Wang, Dong Gong, Sen Wang, Zi Huang, Yadan Luo

TL;DR
This paper introduces Token Condensation as Adaptation (TCA), a training-free method that improves vision-language model performance on unseen datasets by efficiently condensing tokens, achieving significant accuracy gains and computational savings.
Contribution
The paper proposes TCA, a novel training-free token condensation technique that enhances zero-shot transfer and robustness of vision-language models like CLIP.
Findings
Up to 21.4% performance improvement on cross-dataset benchmarks.
Reduces GFLOPs by 12.2% to 48.9%.
Minimal hyperparameter dependency.
Abstract
Contrastive Language-Image Pretraining (CLIP) excels at learning generalizable image representations but often falls short in zero-shot inference on certain downstream datasets. Test-time adaptation (TTA) mitigates this issue by adjusting components like normalization layers or context prompts, yet it typically requires large batch sizes and extensive augmentations, leading to high computational costs. This raises a key question: Can VLMs' performance drop in specific test cases be mitigated through efficient, training-free approaches? To explore the solution, we investigate token condensation (TC) techniques, originally designed to enhance vision transformer efficiency by refining token usage during inference. We observe that informative tokens improve visual-text alignment in VLMs like CLIP on unseen datasets. However, existing TC methods often fail to maintain in-distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques
MethodsContrastive Language-Image Pre-training · ALIGN · Pruning
