DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

Lingjun Zhang,Changjie Wu,Linzhe Shi,Jiangyang Li,Jiaxin Liu,Lei Yang,Hang Zhang,Mu Xu,Hong Wang

arXiv:2605.10564·cs.CV·May 12, 2026

DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

Lingjun Zhang,Changjie Wu,Linzhe Shi,Jiangyang Li,Jiaxin Liu,Lei Yang,Hang Zhang,Mu Xu,Hong Wang

PDF

1 Repo

TL;DR

DeepSight introduces a long-horizon world model predicting future BEV features and an adaptive text reasoning mechanism, achieving state-of-the-art results in autonomous driving benchmarks.

Contribution

It proposes a novel latent state prediction approach for long-term world modeling and an adaptive reasoning module tailored for autonomous driving scenarios.

Findings

01

Achieved SOTA results on the Bench2drive benchmark.

02

Demonstrated effective long-horizon world modeling in BEV space.

03

Enhanced driving performance with social knowledge-based reasoning.

Abstract

End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hotdogcheesewhite/DeepSight
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.