Subspace Alignment for Vision-Language Model Test-time Adaptation

Zhichen Zeng; Wenxuan Bao; Xiao Lin; Ruizhong Qiu; Tianxin Wei; Xuying Ning; Yuchen Yan; Chen Luo; Monica Xiao Cheng; Jingrui He; Hanghang Tong

arXiv:2601.08139·cs.CV·January 14, 2026

Subspace Alignment for Vision-Language Model Test-time Adaptation

Zhichen Zeng, Wenxuan Bao, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Xuying Ning, Yuchen Yan, Chen Luo, Monica Xiao Cheng, Jingrui He, Hanghang Tong

PDF

Open Access

TL;DR

SubTTA improves vision-language model test-time adaptation by aligning semantic subspaces of visual and textual modalities, effectively addressing distribution shifts and visual noise, leading to better zero-shot guidance and performance.

Contribution

The paper introduces SubTTA, a novel method that aligns semantic subspaces of modalities and filters visual noise, enhancing test-time adaptation of VLMs under distribution shifts.

Findings

01

SubTTA achieves an average of 2.24% improvement over state-of-the-art TTA methods.

02

Aligning semantic subspaces reduces modality gap and visual nuisance.

03

Extensive experiments validate the effectiveness across benchmarks and architectures.

Abstract

Vision-language models (VLMs), despite their extraordinary zero-shot capabilities, are vulnerable to distribution shifts. Test-time adaptation (TTA) emerges as a predominant strategy to adapt VLMs to unlabeled test data on the fly. However, existing TTA methods heavily rely on zero-shot predictions as pseudo-labels for self-training, which can be unreliable under distribution shifts and misguide adaptation due to two fundamental limitations. First (Modality Gap), distribution shifts induce gaps between visual and textual modalities, making cross-modal relations inaccurate. Second (Visual Nuisance), visual embeddings encode rich but task-irrelevant noise that often overwhelms task-specific semantics under distribution shifts. To address these limitations, we propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions to better guide the TTA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis