Simple, Good, Fast: Self-Supervised World Models Free of Baggage
Jan Robine, Marc H\"oftmann, Stefan Harmeling

TL;DR
This paper introduces SGF, a simple, efficient, and effective self-supervised world model that avoids complex architectures, demonstrating strong performance on Atari benchmarks through ablation studies and comparisons.
Contribution
The paper presents SGF, a novel self-supervised world model that eschews traditional complex components, focusing on simplicity, robustness, and competitive performance.
Findings
SGF achieves competitive results on Atari 100k benchmark.
Ablation studies highlight the importance of data augmentation and stacking.
SGF outperforms some existing models in efficiency and robustness.
Abstract
What are the essential components of world models? How far do we get with world models that are not employing RNNs, transformers, discrete representations, and image reconstructions? This paper introduces SGF, a Simple, Good, and Fast world model that uses self-supervised representation learning, captures short-time dependencies through frame and action stacking, and enhances robustness against model errors through data augmentation. We extensively discuss SGF's connections to established world models, evaluate the building blocks in ablation studies, and demonstrate good performance through quantitative comparisons on the Atari 100k benchmark.
Peer Reviews
Decision·ICLR 2025 Poster
## Presentation This paper is well-written with thorough discussion on the related works, the design philosophy and the precise formulations of the proposed modeling. The discussions are usually precise and insightful. The important elements in building the proposed world-models, such as the POMDP formulation, the representation learning (including sufficient details, such as image augmentations, temporal consistency, covariance regularization), the dynamics learning (the conditional independenc
There are a few limitations that seem to limit the contributions of this work. - As discussed thoroughly in the related work section in the paper, world model in training reinforcement learning agents is not a new idea. In such cases, it is useful to establish that this work is addressing a significant weakness in prior works, without sacrificing other important metrics. In this case, the main motivation of SGF seems to be presenting a simple, fast, yet accurate method to train a good world mod
1. I believe these kinds of works are important. It is easy to just incrementally propose new components to improve the performance of systems while considerably increasing the engineering complexity. This does not give a clear view of the actual importance of the components included in the SOTA of world modeling. Going in a completely different direction is in my opinion a needed move sometimes, and it will help to shape new design choices for world modeling. Hence, the motivation is strong, an
1. I believe that while the proposed method focuses on short-term dependencies, as correctly stated in the limitation, how much performance degrades with an increasingly long-term dependency on actions would be important to quantify. This will allow us to assess the limitations of the proposed method in a more robust manner, for people to build upon. 2. It is not clear to me why only VICReg is chosen for representation extraction. There are relationships with BYOL and SimSiam as reported in the
In this work, authors try to find the most necessary components that are the most optimal in terms of accuracy and training time. 1. The presented SGF approach lies on the Pareto optimality curve on the chart of Accuracy (normalized mean score) and Training time (hours) - Figure 5, where the other points on the Pareto optimality curve are: SPR, DreamerV3, EfficientZero (performs lookahead) 2. The optimal combination of improvements (frame stacking, action stacking, temporal consistency, augmen
1. While optimal sizes of models and training times have been found in Table 6 to achieve the highest possible mean scores, this may mean that either the scalability of the approach is limited, or the approach must scale in many directions simultaneously to achieve even higher mean scores. Although the approach lies on the Pareto optimality curve on the Accuracy vs Training time chart, i.e. it is one of many optimal options, it is not shown how this approach can be scaled or improved to achieve
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning
