WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Ao Liang; Lingdong Kong; Tianyi Yan; Hongsi Liu; Wesley Yang; Ziqi Huang; Wei Yin; Jialong Zuo; Yixuan Hu; Dekai Zhu; Dongyue Lu; Youquan Liu; Guangfeng Jiang; Linfeng Li; Xiangtai Li; Long Zhuo; Lai Xing Ng; Benoit R. Cottereau; Changxin Gao; Liang Pan; Wei Tsang Ooi; Ziwei Liu

arXiv:2512.10958·cs.CV·December 12, 2025

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu

PDF

Open Access

TL;DR

WorldLens introduces a comprehensive benchmark, dataset, and evaluation model to assess the realism, physics, and behavior of generative driving world models, addressing a key gap in embodied AI evaluation.

Contribution

It provides the first unified framework for evaluating geometric, physical, and behavioral fidelity of driving world models, including a large annotated dataset and an explainable scoring agent.

Findings

01

No existing model excels across all aspects of realism and behavior.

02

Textures often violate physics, while geometry-stable models lack behavioral fidelity.

03

The benchmark and dataset enable standardized, human-aligned evaluation of world models.

Abstract

Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI · Multimodal Machine Learning Applications