Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
Lixian Chen, Yanhui Chen, Junyi Lin

TL;DR
This paper introduces MG-MTTA, a test-time adaptation method for vision-language models that manages modality reliability to improve accuracy under modality-specific shifts.
Contribution
It proposes a novel majorization-based approach that constrains adaptation to address asymmetric modality shifts without altering the backbone.
Findings
MG-MTTA improves top-1 accuracy on ImageNet-based benchmarks under semantic and joint shifts.
The method maintains competitive performance in visual-only settings.
Analysis provides conditions for entropy minimization to preserve correct modality ranking.
Abstract
Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
