Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
Zhangcheng Hou, Tomoaki Ohtsuki

TL;DR
This paper introduces Radar-Modulated Selection (RMS), a novel method for radar-camera depth estimation that integrates radar within the model's core, leading to state-of-the-art accuracy and efficiency.
Contribution
The paper proposes RMS, a new approach that modulates the model internally with radar data, enabling linear-cost cross-modal coupling and better fallback to image-only processing.
Findings
Achieves state-of-the-art depth estimation on nuScenes with 34% MAE reduction.
RMS provides lowest single-frame latency at 26.8ms.
In-scan selection replaces out-of-scan fusion without loss of accuracy.
Abstract
Radar-camera depth estimation must turn an ultra-sparse, all-weather, metric radar signal into a dense per-pixel depth map. Existing methods -- concatenation, confidence-aware gating, sparse supervision, graph-based extraction -- combine radar and image features outside the backbone's sequence operator, and even cross-modal Mamba variants leave the selection mechanism itself unimodal. We argue that the selection mechanism is the right place for radar to enter. We introduce Radar-Modulated Selection (RMS), a minimal and principled way to inject radar into Mamba's selective scan: radar modulates the scan from within, adding zero-initialised perturbations to the step size and readout while leaving the input projection and state dynamics image-only. The construction is exactly equivalent to a pretrained image-only Mamba at initialisation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
