Test-Time Consistency in Vision Language Models
Shih-Han Chou, Shivam Chandhok, James J. Little, Leonid Sigal

TL;DR
This paper introduces a simple, post-hoc test-time framework to improve the semantic consistency of vision-language models without retraining, significantly enhancing their reliability across equivalent inputs.
Contribution
It proposes a novel, model-agnostic, test-time consistency method using two objectives, applicable to any VLM, to improve semantic consistency without supervised re-training.
Findings
Significant improvements in consistency on MM-R3 benchmark.
Applicable to various state-of-the-art VLMs without retraining.
Establishes a new inference-time adaptation approach for multimodal models.
Abstract
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks, yet they often exhibit inconsistent behavior when faced with semantically equivalent inputs, undermining their reliability and robustness. Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs, despite maintaining high average accuracy. Prior work addresses this issue by modifying model architectures or conducting large-scale fine-tuning on curated datasets. In contrast, we propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training. Our method is entirely post-hoc, model-agnostic, and applicable to any VLM with access to its weights. Given a single test point, we enforce consistent predictions via two complementary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
