Radar Spectra-Language Model for Automotive Scene Parsing
Mariia Pushkareva, Yuri Feldman, Csaba Domokos, Kilian Rambach, Dotan, Di Castro

TL;DR
This paper introduces a radar spectra-language model that enhances interpretability and scene understanding in autonomous driving by enabling free-text querying of radar spectra, improving tasks like scene retrieval, free space segmentation, and object detection.
Contribution
We develop a novel radar spectra-language model that leverages vision-language embeddings to interpret radar spectra and improve scene perception in autonomous driving.
Findings
Improved free space segmentation using radar spectra embeddings.
Enhanced object detection performance with spectra-based features.
Effective querying of radar spectra for scene elements using natural language.
Abstract
Radar sensors are low cost, long-range, and weather-resilient. Therefore, they are widely used for driver assistance functions, and are expected to be crucial for the success of autonomous driving in the future. In many perception tasks only pre-processed radar point clouds are considered. In contrast, radar spectra are a raw form of radar measurements and contain more information than radar point clouds. However, radar spectra are rather difficult to interpret. In this work, we aim to explore the semantic information contained in spectra in the context of automated driving, thereby moving towards better interpretability of radar spectra. To this end, we create a radar spectra-language model, allowing us to query radar spectra measurements for the presence of scene elements using free text. We overcome the scarcity of radar spectra data by matching the embedding space of an existing…
Peer Reviews
Decision·Submitted to ICLR 2024
This paper introduces the text information into feature fusion for radar spectra interpretability.
1. The framework seems to be a simple combination of existing methods. I didn’t see the specific design for the radar spectra language model. 2. The experiment of detection is not compared with SOTA methods such as RODNet. 3. What is [20] in Table 3? 4. If the description includes multiple object information, how do you align the text information with the corresponding object?
The radar spectrum pre-training to optimize on similarity to fine-tuned OpenCLIP is novel. It allows for pre-training without a need for explicit Radar-spectra dataset.
No discussion on what is still hard to do or not reliable. Also analysis of the varying the difficulty of the input scenes would help answer the previous question.
1. To the best of my knowledge, this is the first paper trying to build a radar spectra-language model. 2. The fine-tuned VLM for autonomous driving scenes works much better than the off-the-shell CLIP. 3. The zero-shot retrieval ability of RSLM is impressive, especially for the small objects such as pedestrian and cyclist.
1. The author seems to lack paper writing skills. All the figures are unaesthetic bitmaps with low resolution and some of the figures are not necessary. For Figure 4a, it is better to use formulation instead of python code to describe the loss functions. For Figure 4b, such a simple architecture may be put in the supplement material. 2. Changing the position encoding without finetuning may cause performance drop, and splitting the image may break some objects on the edge. A better and more comm
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
