WorldGPT: Empowering LLM as Multimodal World Model
Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang,, Siliang Tang, Yueting Zhuang

TL;DR
WorldGPT is a multimodal large language model that learns world dynamics from videos, integrates memory and knowledge mechanisms, and demonstrates strong capabilities in scenario modeling, prediction, and domain generalization.
Contribution
It introduces WorldGPT, a generalist multimodal world model trained on videos, with a novel cognitive architecture and a new benchmark for evaluating world state transitions.
Findings
WorldGPT accurately models complex world dynamics.
It outperforms existing models in predicting state transitions.
It can generate reliable synthetic data for fine-tuning multimodal agents.
Abstract
World models are progressively being employed across diverse fields, extending from basic environment simulation to complex scenario construction. However, existing models are mainly trained on domain-specific states and actions, and confined to single-modality state representations. In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. To further enhance WorldGPT's capability in specialized scenarios and long-term tasks, we have integrated it with a novel cognitive architecture that combines memory offloading, knowledge retrieval, and context reflection. As for evaluation, we build WorldNet, a multimodal state transition prediction benchmark encompassing varied real-life scenarios. Conducting evaluations on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems
