From Tokens to Photons: Test-Time Physical Prompting for Vision-Language Models

Boyeong Im; Wooseok Lee; Yoojin Kwon; and Hyung-Sin Kim

arXiv:2512.12571·cs.CV·February 2, 2026

From Tokens to Photons: Test-Time Physical Prompting for Vision-Language Models

Boyeong Im, Wooseok Lee, Yoojin Kwon, and Hyung-Sin Kim

PDF

Open Access

TL;DR

This paper introduces MVP, a test-time physical prompting framework that uses camera settings as physical prompts to improve vision-language model robustness in real environments, outperforming digital-only methods.

Contribution

MVP is a novel framework that leverages physical camera settings as prompts during test-time adaptation, moving beyond digital prompts for enhanced robustness.

Findings

01

MVP outperforms digital-only TTA by up to 25.6 percentage points on ImageNet datasets.

02

MVP achieves up to 3.4 percentage points additional gains over sensor control plus TTA pipelines.

03

MVP remains effective with reduced parameter sets, lowering capture latency.

Abstract

To extend the application of vision-language models (VLMs) from web images to sensor-mediated physical environments, we propose Multi-View Physical-prompt for Test-Time Adaptation (MVP), a forward-only framework that moves test-time adaptation (TTA) from tokens to photons by treating the camera exposure triangle--ISO, shutter speed, and aperture--as physical prompts. At inference, MVP acquires a library of physical views per scene, selects the top-k sensor settings using a source-affinity score, evaluates each retained view under lightweight digital augmentations, filters the lowest-entropy subset of augmented views, and aggregates predictions with Zero-temperature softmax (i.e., hard voting). This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP consistently outperforms digital-only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications