LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model
Xiaodong Wang, Zhirong Wu, Peixi Peng

TL;DR
This paper introduces LongDWM, a hierarchical and distillation-based approach to improve long-term driving world models, achieving more coherent and efficient long video generation in autonomous driving scenarios.
Contribution
The paper proposes a novel hierarchical decoupling and self-supervised distillation method to enhance long-term video prediction in driving models, addressing the training-inference gap.
Findings
27% improvement in FVD on NuScenes benchmark
85% reduction in inference time for long video generation
Enhanced coherence in infinite driving scene videos
Abstract
Driving world models are used to simulate futures by video generation based on the condition of the current state and actions. However, current models often suffer serious error accumulations when predicting the long-term future, which limits the practical application. Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility. However, these models are always trained on short video clips (high fps and short duration), and multiple roll-out generations struggle to produce consistent and reasonable long videos due to the training-inference gap. To this end, we propose several solutions to build a simple yet effective long-term driving world model. First, we hierarchically decouple world model learning into large motion learning and bidirectional continuous motion learning. Then, considering the continuity of driving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProcess Optimization and Integration
MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · Diffusion
