PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
Yuliang Li, Chu Zhou, Heng Guo, Boxin Shi, Imari Sato, Zhanyu Ma

TL;DR
PolarVLM introduces a multimodal vision-language framework that incorporates polarimetric physical parameters to address optical ambiguities like reflections and transparency, enhancing semantic understanding in complex scenes.
Contribution
It is the first to integrate polarimetric physical parameters into VLMs using a dual-stream architecture and a new polarization-aware VQA benchmark.
Findings
PolarVLM outperforms RGB baseline by 25.4% overall.
Achieves 26.6% improvement in reflection recognition.
Attains 34.0% better performance in glass counting.
Abstract
Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
