DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Kairun Wen; Yuzhi Huang; Runyu Chen; Hui Zheng; Yunlong Lin; Panwang Pan; Chenxin Li; Wenyan Cong; Jian Zhang; Junbin Lu; Chenguo Lin; Dilin Wang; Zhicheng Yan; Hongyu Xu; Justin Theiss; Yue Huang; Xinghao Ding; Rakesh Ranjan; Zhiwen Fan

arXiv:2512.03000·cs.CV·December 4, 2025

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan

PDF

Open Access 2 Datasets 1 Video

TL;DR

DynamicVerse introduces a comprehensive 4D multimodal framework that interprets and models real-world dynamic scenes from monocular videos, enabling more accurate and detailed understanding of physical environments.

Contribution

It presents a large-scale dataset and a novel integration of vision and geometric models for 4D world modeling from internet videos, surpassing existing methods in accuracy.

Findings

01

Superior performance in depth estimation

02

Enhanced camera pose accuracy

03

More precise physical-scale measurements

Abstract

Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Multimodal Machine Learning Applications