TL;DR
DeepSight introduces a long-horizon world model predicting future BEV features and an adaptive text reasoning mechanism, achieving state-of-the-art results in autonomous driving benchmarks.
Contribution
It proposes a novel latent state prediction approach for long-term world modeling and an adaptive reasoning module tailored for autonomous driving scenarios.
Findings
Achieved SOTA results on the Bench2drive benchmark.
Demonstrated effective long-horizon world modeling in BEV space.
Enhanced driving performance with social knowledge-based reasoning.
Abstract
End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
