Enhancing End-to-End Autonomous Driving with Latent World Model
Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang, Zhang, Tieniu Tan

TL;DR
This paper introduces LAW, a self-supervised latent world model that enhances scene feature learning for end-to-end autonomous driving, leading to improved trajectory prediction and state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes a novel self-supervised learning approach using the LAtent World model (LAW) to improve scene feature representations in end-to-end driving systems.
Findings
LAW achieves state-of-the-art performance on nuScenes, NAVSIM, and CARLA benchmarks.
Self-supervised learning with LAW enhances trajectory prediction accuracy.
The approach is effective in both perception-free and perception-based frameworks.
Abstract
In autonomous driving, end-to-end planners directly utilize raw sensor data, enabling them to extract richer scene features and reduce information loss compared to traditional planners. This raises a crucial research question: how can we develop better scene feature representations to fully leverage sensor data in end-to-end driving? Self-supervised learning methods show great success in learning rich feature representations in NLP and computer vision. Inspired by this, we propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving. LAW predicts future scene features based on current features and ego trajectories. This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks, improving scene feature learning and optimizing trajectory prediction. LAW achieves state-of-the-art performance across…
Peer Reviews
Decision·ICLR 2025 Poster
1. Novel integration of world model concepts into end-to-end driving 2. Comprehensive experimental validation across multiple benchmarks. Demonstrates practical improvements in both closed and open-loop settings
1. Limited discussion of computational overhead - no analysis of inference time or model size. Autonomous driving systems must make decisions in real-time, typically requiring processing speeds of at least 10-20 Hz (decisions every 50-100ms). Without inference time analysis, it's unclear if LAW is applicable for real deployment on edge computing devices. 2. No discussion of robustness to adverse weather/lighting conditions. As in Appendix A.1, the augmentation is claimed to enhance the robustnes
The proposed LAW framework utilized a self-supervised method to significantly reduce the need for heavy annotation tasks, addressing the data scalability challenge of many existing methods. The detailed breakdowns of ablation studies, latency analyses, and visualizations provide readers with clear and comprehensive information to understand and reproduce the work.
The view selection strategy is a valuable insight to improve the efficiency of the method, but it adds complexity to the overall framework. Although there is only a minimal performance drop, it seems the view selection strategy hasn’t fully captured the informative scenes in driving scenarios. If there could be more discussion or analysis on what caused the performance drop, or how this issue could be mitigated with the Latent World Model, it would make the work more complete.
1. Introduction of the Latent World Model (LAW) to predict future scene latents from current scene latents and ego trajectories. 2. Demonstrated universality across various common autonomous driving paradigms, i.e., perception-free and perception-based frameworks. 3. Extensive experiments conducted on multiple benchmarks, achieving state-of-the-art performance on real-world open-loop datasets like nuScenes and simulator-based closed-loop CARLA benchmark.
See the Questions section.
Code & Models
Videos
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Advanced Neural Network Applications
