Masked Generative Priors Improve World Models Sequence Modelling Capabilities
Cristian Meo, Mircea Lica, Zarif Ikram, Akihiro Nakano, Vedant Shah, Aniket Rajiv Didolkar, Dianbo Liu, Anirudh Goyal, Justin Dauwels

TL;DR
This paper introduces GIT-STORM, a novel world model that uses masked generative priors to improve sequence modeling in reinforcement learning and video prediction, demonstrating significant performance gains.
Contribution
It replaces traditional priors with masked generative priors in world models, enabling better sequence modeling and extending Transformer-based world models to continuous control environments.
Findings
GIT-STORM outperforms previous models on Atari 100k benchmark.
Transformer-based world models are effective for continuous control tasks.
Masked generative priors enhance sequence modeling capabilities.
Abstract
Deep Reinforcement Learning (RL) has become the leading approach for creating artificial agents in complex environments. Model-based approaches, which are RL methods with world models that predict environment dynamics, are among the most promising directions for improving data efficiency, forming a critical step toward bridging the gap between research and real-world deployment. In particular, world models enhance sample efficiency by learning in imagination, which involves training a generative sequence model of the environment in a self-supervised manner. Recently, Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling and generating token sequences. Building on the Efficient Stochastic Transformer-based World Models (STORM) architecture, we replace the traditional MLP prior with a Masked Generative Prior (e.g., MaskGIT Prior) and…
Peer Reviews
Decision·Submitted to ICLR 2025
The empirical evaluation spans discrete and continuous action benchmarks, providing a robust assessment of GIT-STORM’s performance. The reported results demonstrate that GIT-STORM not only improves sample efficiency in RL tasks but also enhances video prediction quality, particularly in the Atari 100k benchmark, aligning well with the study's objectives. Moreover, the paper is well-written with a clear structure, providing a good experience as a reader. Extending the transformer-based world mode
It remains unclear why GIT-STORM does not consistently outperform STORM across all benchmarks or why it fails to close the performance gap with DreamerV3 in environments beyond Atari 100k. The paper does not fully explain the conditions under which GIT-STORM’s improvements are more marginal, suggesting a need for clearer insights into the impact of individual architectural components. The paper claims state-of-the-art results for GIT-STORM on select environments, yet Table 6 seems to indicate t
- The motivation of incorporating a MaskGIT prior into the STORM architecture is clear. - The proposed method is straightforward and easy to reproduce. - MaskGIT can effectively improve the video prediction quality of STORM, indicating applicability of GIT-STORM to more complicated tasks.
- The paper contains a misstatement in its contributions. The authors claim that they "apply transformer-based world models to continuous action environments for the first time". This claim is inaccurate, as TransDreamer[1] can also be applied to continuous action environments. The authors are evidently aware of this paper, given that they have cited it in this work. - The state-mixer design is not properly addressed. If the authors claim this part of their contribution, they should either elabo
1. This paper clearly distinguishes itself from previous work, with good comparison and illustration. 2. One-hot categorical latent is widely used in recent model-based RL, yet the research on it is insufficient. This paper provides a novel view of it. 3. This paper bridges the gap of the lack of evaluation of transformer-based world models on continuous control tasks.
1. The motivation and effect of using MaskGIT head in world models are unclear. Is there any evidence that the world models would have hallucinations, and how could a MaskGIT head mitigate such issues? How to distinguish if the improved performance (both RL and FVD) comes from more parameters or the MaskGIT prior? There should be some further investigation into the mechanism of the MaskGIT head. Such as: (a) What's the difference between the latent variables (or distributions) generate
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Advanced Database Systems and Queries · Graph Theory and Algorithms
