VLMaterial: Vision-Language Model-Based Camera-Radar Fusion for Physics-Grounded Material Identification
Jiangyou Zhu, He Chen

TL;DR
VLMaterial is a novel, training-free framework that fuses vision-language models with radar data for accurate, physics-grounded material identification in diverse real-world scenarios.
Contribution
It introduces a dual-pipeline architecture and an adaptive fusion mechanism to interpret electromagnetic parameters and improve material recognition without training.
Findings
Achieved 96.08% recognition accuracy in real-world tests.
Performed on par with state-of-the-art closed-set methods.
Validated across 120 experiments with diverse objects and environments.
Abstract
Accurate material recognition is a fundamental capability for intelligent perception systems to interact safely and effectively with the physical world. For instance, distinguishing visually similar objects like glass and plastic cups is critical for safety but challenging for vision-based methods due to specular reflections, transparency, and visual deception. While millimeter-wave (mmWave) radar offers robust material sensing regardless of lighting, existing camera-radar fusion methods are limited to closed-set categories and lack semantic interpretability. In this paper, we introduce VLMaterial, a training-free framework that fuses vision-language models (VLMs) with domain-specific radar knowledge for physics-grounded material identification. First, we propose a dual-pipeline architecture: an optical pipeline uses the segment anything model and VLM for material candidate proposals,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
