RLM: A Vision-Language Model Approach for Radar Scene Understanding
Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia

TL;DR
This paper introduces RadarVLM, a unified vision-language model for radar scene understanding that leverages structured spatial language supervision and a novel training objective to improve spatial reasoning and perception accuracy.
Contribution
The paper presents a new radar-language framework with a structured captioning method and a Spatially-Grounded CLIP objective, enabling fine-grained spatial reasoning in radar perception tasks.
Findings
SG-CLIP outperforms vanilla CLIP with up to 50% F1-score improvement.
Proposed metrics effectively evaluate spatial accuracy beyond linguistic similarity.
RadarVLM enhances segmentation performance with a 21% AP gain.
Abstract
Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions, yet existing machine learning approaches remain fragmented and task-specific, with each downstream task employing distinct architectures and training objectives. We present RadarVLM, a vision-language framework that learns unified scene-level representations through structured spatial language supervision. Leveraging the CARLA simulator with a realistic radar model, we collect over 800k radar-caption pairs across 110+ hours of simulated driving in diverse scenarios. We make two key contributions: (1) a structured caption framework encoding vehicle distributions in the radar's native coordinate system, and (2) Spatially-Grounded CLIP (SG-CLIP) objective that replaces binary matching with continuous scene similarity, enabling fine-grained spatial reasoning. We further propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced SAR Imaging Techniques
