Accurate and Efficient World Modeling with Masked Latent Transformers
Maxime Burchi, Radu Timofte

TL;DR
This paper introduces EMERALD, a novel world model using masked latent transformers that achieves state-of-the-art performance in environment modeling, surpassing human experts efficiently within 10 million steps.
Contribution
EMERALD combines masked transformer predictions with latent space modeling to improve accuracy and efficiency over prior pixel-based methods.
Findings
EMERALD surpasses human expert performance on Crafter benchmark.
It unlocks all 22 Crafter achievements during evaluation.
Achieves state-of-the-art results with fewer environment steps.
Abstract
The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model's latent space can result in the loss of crucial information, negatively affecting the agent's performance. Recent approaches, such as -IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting
