DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving
Haisheng Su, Wei Wu, Feixiang Song, Junjie Zhang, Zhenjie Yang, Junchi Yan

TL;DR
DriveMamba introduces a task-centric, scalable end-to-end autonomous driving framework that efficiently models long-term temporal dependencies and task relations, surpassing existing methods in accuracy and efficiency.
Contribution
It proposes a novel unified decoder with linear complexity that integrates dynamic task relation modeling, implicit view correspondence, and long-term temporal fusion for autonomous driving.
Findings
Outperforms existing methods on nuScenes and Bench2Drive datasets.
Demonstrates superior efficiency and generalizability.
Effectively models long-term temporal dependencies.
Abstract
Recent advances towards End-to-End Autonomous Driving (E2E-AD) have been often devoted on integrating modular designs into a unified framework for joint optimization e.g. UniAD, which follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper clearly articulates two significant problems in E2E-AD: 1) The limitations of sequential, manually-ordered pipelines, such as information loss and error accumulation. 2) The efficiency and scalability constraints imposed by the quadratic complexity of attention in Transformer models. - The idea to replace the Transformer decoder with a Mamba-based (SSM) decoder directly addresses the efficiency and scalability problem. The linear complexity of SSMs is a clear advantage for processi
I have the following concerns from this work - The "Trajectory-Centric Local2Global" scan (L232) creates a potential circular dependency. The scan order, which is an input to the decoder layers, is determined by an importance weight $w_i$ calculated from an intermediate predicted ego-trajectory $\psi^{\prime}$. This means the decoder's output (the trajectory) is required to define its input (the scan order). The paper does not specify how this intermediate trajectory is generated or analyze the
1. Unified Mamba Decoder: Achieves linear complexity while jointly processing perception, map, and planning queries, showing clear scalability advantages on high-resolution multi-camera inputs. 2. Hybrid Spatiotemporal Scan (HSS): Cleverly alternates between spatial and ego-centric scanning to balance locality preservation and long-range temporal consistency. 3. Task-centric tokenization: Structured query design (ego/map/agent) improves modular interpretability and relational learning. 4. 3D sen
1. Insufficient HSS details: The paper lacks explicit layer-wise configurations or stability studies when varying scan order; the contribution of each H/V-first and L2G layer is not isolated. 2. FPS reporting: Experimental FPS comparisons are unclear due to missing details on resolution, camera count, and hardware setup. 3. Depth branch robustness: No analysis of depth estimation noise, calibration error, or trade-offs between uniform-ray and learned-depth methods. 4. Trajectory prior ambiguity:
>S1. By replacing the quadratic-complexity Transformer with a Mamba-based decoder (SSM), the method effectively solves the major bottleneck of E2E-AD systems. This design drastically reduces memory consumption and makes the decoder easily scalable through simple layer stacking, which is a critical contribution to the exploration of scalable E2E-AD systems. >S2. The ablation study rigorously confirms that simply stacking the decoder layers monotonically improves CIPO (Closest In-Path Objects) pe
>The work is well-executed, and I have only one significant concern regarding the robustness properties of the proposed architecture. >W1. As shown in Table 10, the performance of the trajectory-guided scan appears to be highly dependent on the accuracy of the predicted trajectory. This suggests that the model might perform poorly and lack robustness in Out-of-Distribution (OOD) or extreme scenes with significant domain gaps, potentially causing planning failures. Given that robust operation in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Advanced Neural Network Applications · Multimodal Machine Learning Applications
