PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

Yuliang Li; Chu Zhou; Heng Guo; Boxin Shi; Imari Sato; Zhanyu Ma

arXiv:2605.07574·cs.CV·May 12, 2026

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

Yuliang Li, Chu Zhou, Heng Guo, Boxin Shi, Imari Sato, Zhanyu Ma

PDF

TL;DR

PolarVLM introduces a multimodal vision-language framework that incorporates polarimetric physical parameters to address optical ambiguities like reflections and transparency, enhancing semantic understanding in complex scenes.

Contribution

It is the first to integrate polarimetric physical parameters into VLMs using a dual-stream architecture and a new polarization-aware VQA benchmark.

Findings

01

PolarVLM outperforms RGB baseline by 25.4% overall.

02

Achieves 26.6% improvement in reflection recognition.

03

Attains 34.0% better performance in glass counting.

Abstract

Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.