Navigation-Guided Sparse Scene Representation for End-to-End Autonomous Driving
Peidong Li, Dixiao Cui

TL;DR
This paper introduces SSR, a navigation-guided sparse scene representation framework that enhances end-to-end autonomous driving by reducing reliance on expensive annotations and improving efficiency and performance.
Contribution
SSR is a novel framework that uses only 16 navigation tokens for scene representation, eliminating the need for supervised perception tasks and improving real-time driving performance.
Findings
27.2% reduction in L2 error compared to baseline
51.6% decrease in collision rate in nuScenes
10.9x faster inference speed
Abstract
End-to-End Autonomous Driving (E2EAD) methods typically rely on supervised perception tasks to extract explicit scene information (e.g., objects, maps). This reliance necessitates expensive annotations and constrains deployment and data scalability in real-time applications. In this paper, we introduce SSR, a novel framework that utilizes only 16 navigation-guided tokens as Sparse Scene Representation, efficiently extracting crucial scene information for E2EAD. Our method eliminates the need for human-designed supervised sub-tasks, allowing computational resources to concentrate on essential elements directly related to navigation intent. We further introduce a temporal enhancement module, aligning predicted future scenes with actual future scenes through self-supervision. SSR achieves a 27.2\% relative reduction in L2 error and a 51.6\% decrease in collision rate to UniAD in nuScenes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Advanced Neural Network Applications · Human-Automation Interaction and Safety
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
