DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Yang Zhou; Hao Shao; Letian Wang; Zhuofan Zong; Hongsheng Li; Steven L. Waslander

arXiv:2601.01528·cs.CV·March 10, 2026

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, Steven L. Waslander

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

DrivingGen introduces a comprehensive benchmark for evaluating generative video world models in autonomous driving, addressing current limitations in metrics, datasets, and evaluation scope to advance safe and realistic simulation.

Contribution

It provides the first diverse dataset and suite of metrics for assessing generative driving world models, enabling more reliable and controllable autonomous driving simulations.

Findings

01

General models excel in visual quality but lack physical accuracy.

02

Driving-specific models better capture motion but have lower visual realism.

03

Benchmark reveals trade-offs between visual fidelity and physical plausibility.

Abstract

Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- DrivingGen is the first benchmark to jointly evaluate visual, kinematic, and interactive aspects of generative driving world models, addressing critical gaps in prior works. - Introduces FTD for trajectory distribution, kinematic quality scores, and agent disappearance detection using VLMs—all tailored for driving safety and realism. - The dataset includes under-represented conditions (e.g., night, snow, sandstorms) and global geographic variety, enabling more robust and realistic evaluation.

Weaknesses

- With only 400 clips, the dataset may not fully represent the long-tail of real-world driving scenarios, despite its diversity. - The benchmark focuses on open-loop video generation and does not assess models in interactive, closed-loop simulation settings.

Reviewer 02Rating 6Confidence 4

Strengths

1. The dataset distributions are balanced and diverse across all conditions. 2. DrivingGen has created a new, specialized set of various measurements to evaluate generated driving videos; these are designed for the complexities of driving and are therefore more effective than standard video evaluation tools. 3. The experiments are comprehensive.

Weaknesses

1. It would be more convincing to incorporate downstream task's performance (detection, mapping, planning) into the evaluation system. But this seems difficult since the data contains only front-view data.

Reviewer 03Rating 6Confidence 5

Strengths

Establishing a more practical benchmark for driving video generation is valuable. The contributions of providing more diverse test data and comprehensive evaluation metrics are clear. To maximize its contribution, the benchmark and evaluation code should be made open-source.

Weaknesses

No major flaws were identified in the current work. However, the authors could further improve the benchmark by including evaluations for scene content controllability. While the paper addresses video quality, temporal consistency, and ego trajectory controllability, the controllability of generated scene contents( such as agents controlled with bounding boxes or roads via maps/lane) is also important for autonomous driving applications. These control signals can be extracted from videos using d

Code & Models

Datasets

yangzhou99/DrivingGen
dataset· 4.3k dl
4.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications