LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

Wei Luo; Yiting Lu; Xin Li; Haoran Li; Fengbin Guan; Chen Gao; Xin Jin; Yong Li; Zhibo Chen; Sijing Wu; Kang Fu; Yunhao Li; Ziang Xiao; Huiyu Duan; Jing Liu; Qiang Hu; Xiongkuo Min; Guangtao Zhai; Manxi Sun; Zixuan Guo; Yun Li; Ziyang Chen; Manabu Tsukada; Zhengyang Li; Zhenglin Du; Yi Wen; Licheng Jiao; Fang Liu; Lingling Li; Yiwen Ren; Zhilong Song; Dubing Chen; Yucheng Zhou; Tianyi Yan; Huan Zheng

arXiv:2605.05187·cs.CV·May 7, 2026

LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

Wei Luo, Yiting Lu, Xin Li, Haoran Li, Fengbin Guan, Chen Gao, Xin Jin, Yong Li, Zhibo Chen, Sijing Wu, Kang Fu, Yunhao Li, Ziang Xiao, Huiyu Duan, Jing Liu, Qiang Hu, Xiongkuo Min, Guangtao Zhai, Manxi Sun, Zixuan Guo, Yun Li, Ziyang Chen, Manabu Tsukada, Zhengyang Li

PDF

TL;DR

The LoViF 2026 PhyScore challenge evaluates holistic quality of generated videos across multiple dimensions, emphasizing physical plausibility, temporal coherence, and anomaly detection in diverse scenarios.

Contribution

First comprehensive benchmark and challenge for assessing multi-dimensional quality and physical realism in 4D world model-generated videos.

Findings

01

Participants developed metrics predicting four quality dimensions.

02

Benchmark dataset includes 1,554 videos across physics-relevant scenarios.

03

Evaluation combines score prediction and anomaly localization.

Abstract

This paper reports on the LoViF 2026 PhyScore challenge, a competition on holistic quality assessment of world-model-generated videos across both 2D and 4D generation settings. The challenge is motivated by a central gap in current evaluation practice: perceptual quality alone is insufficient to judge whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions. Participants are required to build a metric that jointly predicts four dimensions, i.e., Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency. Depart from that, participants also need to localize physical anomaly timestamps for fine-grained diagnosis. The benchmark dataset contains 1,554 videos generated by seven representative world generative models, organized into three tracks (text-2D, image-to-4D, and video-to-4D) and spanning 26 categories.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.