WorldGPT: Empowering LLM as Multimodal World Model

Zhiqi Ge; Hongzhe Huang; Mingze Zhou; Juncheng Li; Guoming Wang,; Siliang Tang; Yueting Zhuang

arXiv:2404.18202·cs.AI·October 1, 2024

WorldGPT: Empowering LLM as Multimodal World Model

Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang,, Siliang Tang, Yueting Zhuang

PDF

Open Access 1 Repo

TL;DR

WorldGPT is a multimodal large language model that learns world dynamics from videos, integrates memory and knowledge mechanisms, and demonstrates strong capabilities in scenario modeling, prediction, and domain generalization.

Contribution

It introduces WorldGPT, a generalist multimodal world model trained on videos, with a novel cognitive architecture and a new benchmark for evaluating world state transitions.

Findings

01

WorldGPT accurately models complex world dynamics.

02

It outperforms existing models in predicting state transitions.

03

It can generate reliable synthetic data for fine-tuning multimodal agents.

Abstract

World models are progressively being employed across diverse fields, extending from basic environment simulation to complex scenario construction. However, existing models are mainly trained on domain-specific states and actions, and confined to single-modality state representations. In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. To further enhance WorldGPT's capability in specialized scenarios and long-term tasks, we have integrated it with a novel cognitive architecture that combines memory offloading, knowledge retrieval, and context reflection. As for evaluation, we build WorldNet, a multimodal state transition prediction benchmark encompassing varied real-life scenarios. Conducting evaluations on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dcdmllm/worldgpt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems