TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model

Yabo Chen; Yuanzhi Liang; Jiepeng Wang; Tingxi Chen; Junfei Cheng; Zixiao Gu; Yuyang Huang; Zicheng Jiang; Wei Li; Tian Li; Weichen Li; Zuoxin Li; Guangce Liu; Jialun Liu; Junqi Liu; Haoyuan Wang; Qizhen Weng; Xuan'er Wu; Xunzhi Xiang; Xiaoyan Yang; Xin Zhang; Shiwen Zhang; Junyu Zhou; Chengcheng Zhou; Haibin Huang; Chi Zhang; Xuelong Li

arXiv:2601.00051·cs.CV·January 5, 2026

TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model

Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan'er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang

PDF

Open Access

TL;DR

TeleWorld introduces a real-time 4D multimodal world model that unifies video generation, scene reconstruction, and memory, enabling coherent, long-term dynamic scene synthesis with practical computational efficiency.

Contribution

It presents a novel closed-loop framework combining generation, reconstruction, and guidance, with hierarchical planning and distillation techniques for real-time, long-horizon multimodal scene modeling.

Findings

01

Achieves high-quality, consistent dynamic scene generation in real-time.

02

Effectively integrates static and dynamic scene components within a unified 4D model.

03

Demonstrates strong performance in long-term scene understanding and multimodal synthesis.

Abstract

World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation