RLM: A Vision-Language Model Approach for Radar Scene Understanding

Pushkal Mishra; Kshitiz Bansal; Dinesh Bharadia

arXiv:2511.21105·cs.CV·March 16, 2026

RLM: A Vision-Language Model Approach for Radar Scene Understanding

Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia

PDF

Open Access

TL;DR

This paper introduces RadarVLM, a unified vision-language model for radar scene understanding that leverages structured spatial language supervision and a novel training objective to improve spatial reasoning and perception accuracy.

Contribution

The paper presents a new radar-language framework with a structured captioning method and a Spatially-Grounded CLIP objective, enabling fine-grained spatial reasoning in radar perception tasks.

Findings

01

SG-CLIP outperforms vanilla CLIP with up to 50% F1-score improvement.

02

Proposed metrics effectively evaluate spatial accuracy beyond linguistic similarity.

03

RadarVLM enhances segmentation performance with a 21% AP gain.

Abstract

Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions, yet existing machine learning approaches remain fragmented and task-specific, with each downstream task employing distinct architectures and training objectives. We present RadarVLM, a vision-language framework that learns unified scene-level representations through structured spatial language supervision. Leveraging the CARLA simulator with a realistic radar model, we collect over 800k radar-caption pairs across 110+ hours of simulated driving in diverse scenarios. We make two key contributions: (1) a structured caption framework encoding vehicle distributions in the radar's native coordinate system, and (2) Spatially-Grounded CLIP (SG-CLIP) objective that replaces binary matching with continuous scene similarity, enabling fine-grained spatial reasoning. We further propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced SAR Imaging Techniques