Allegory of the Cave: Measurement-Grounded Vision-Language Learning
Kepeng Xu, Li Xu, Gang He, Wenxin Yu

TL;DR
This paper introduces measurement-grounded vision-language learning, demonstrating that using raw sensor measurements instead of RGB images enhances multimodal reasoning accuracy.
Contribution
The authors propose PRISM-VL, a novel approach that incorporates raw camera measurements and exposure-aware supervision, improving grounding in vision-language models.
Findings
PRISM-VL outperforms RGB-based baseline by significant BLEU and ROUGE-L gains.
Measurement-domain evidence preservation reduces grounding errors in VLMs.
Results show improved reasoning in low-light, HDR, and visibility-sensitive scenarios.
Abstract
Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
