Allegory of the Cave: Measurement-Grounded Vision-Language Learning

Kepeng Xu; Li Xu; Gang He; Wenxin Yu

arXiv:2605.11727·cs.AI·May 13, 2026

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

Kepeng Xu, Li Xu, Gang He, Wenxin Yu

PDF

TL;DR

This paper introduces measurement-grounded vision-language learning, demonstrating that using raw sensor measurements instead of RGB images enhances multimodal reasoning accuracy.

Contribution

The authors propose PRISM-VL, a novel approach that incorporates raw camera measurements and exposure-aware supervision, improving grounding in vision-language models.

Findings

01

PRISM-VL outperforms RGB-based baseline by significant BLEU and ROUGE-L gains.

02

Measurement-domain evidence preservation reduces grounding errors in VLMs.

03

Results show improved reasoning in low-light, HDR, and visibility-sensitive scenarios.

Abstract

Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.