iVideoGPT: Interactive VideoGPTs are Scalable World Models

Jialong Wu; Shaofeng Yin; Ningya Feng; Xu He; Dong Li; Jianye Hao,; Mingsheng Long

arXiv:2405.15223·cs.CV·November 1, 2024

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao,, Mingsheng Long

PDF

Open Access 1 Repo 8 Models 1 Video

TL;DR

iVideoGPT introduces a scalable transformer-based framework that integrates multimodal signals for interactive video prediction and decision-making, enabling versatile applications in reinforcement learning and planning.

Contribution

The paper presents a novel compressive tokenization method and a scalable autoregressive transformer architecture for interactive world models trained on extensive manipulation data.

Findings

01

Achieves competitive performance in video prediction and planning tasks

02

Successfully pre-trained on millions of manipulation trajectories

03

Bridges generative video models with practical reinforcement learning applications

Abstract

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thuml/iVideoGPT
pytorchOfficial

Models

Videos

iVideoGPT: Interactive VideoGPTs are Scalable World Models· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques