iVideoGPT: Interactive VideoGPTs are Scalable World Models
Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao,, Mingsheng Long

TL;DR
iVideoGPT introduces a scalable transformer-based framework that integrates multimodal signals for interactive video prediction and decision-making, enabling versatile applications in reinforcement learning and planning.
Contribution
The paper presents a novel compressive tokenization method and a scalable autoregressive transformer architecture for interactive world models trained on extensive manipulation data.
Findings
Achieves competitive performance in video prediction and planning tasks
Successfully pre-trained on millions of manipulation trajectories
Bridges generative video models with practical reinforcement learning applications
Abstract
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗thuml/ivideogpt-oxe-64-act-freemodel· ♡ 2♡ 2
- 🤗thuml/ivideogpt-oxe-64-act-free-mediummodel
- 🤗thuml/ivideogpt-oxe-64-goal-condmodel
- 🤗thuml/ivideogpt-bair-64-act-condmodel
- 🤗thuml/ivideogpt-bair-64-act-freemodel
- 🤗thuml/ivideogpt-robonet-64-act-condmodel
- 🤗thuml/ivideogpt-vp2-robodesk-64-act-condmodel
- 🤗thuml/ivideogpt-vp2-robosuite-64-act-condmodel
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
