From Tokens to Photons: Test-Time Physical Prompting for Vision-Language Models
Boyeong Im, Wooseok Lee, Yoojin Kwon, and Hyung-Sin Kim

TL;DR
This paper introduces MVP, a test-time physical prompting framework that uses camera settings as physical prompts to improve vision-language model robustness in real environments, outperforming digital-only methods.
Contribution
MVP is a novel framework that leverages physical camera settings as prompts during test-time adaptation, moving beyond digital prompts for enhanced robustness.
Findings
MVP outperforms digital-only TTA by up to 25.6 percentage points on ImageNet datasets.
MVP achieves up to 3.4 percentage points additional gains over sensor control plus TTA pipelines.
MVP remains effective with reduced parameter sets, lowering capture latency.
Abstract
To extend the application of vision-language models (VLMs) from web images to sensor-mediated physical environments, we propose Multi-View Physical-prompt for Test-Time Adaptation (MVP), a forward-only framework that moves test-time adaptation (TTA) from tokens to photons by treating the camera exposure triangle--ISO, shutter speed, and aperture--as physical prompts. At inference, MVP acquires a library of physical views per scene, selects the top-k sensor settings using a source-affinity score, evaluates each retained view under lightweight digital augmentations, filters the lowest-entropy subset of augmented views, and aggregates predictions with Zero-temperature softmax (i.e., hard voting). This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP consistently outperforms digital-only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
