Reward-Free Curricula for Training Robust World Models
Marc Rigter, Minqi Jiang, Ingmar Posner

TL;DR
This paper introduces WAKER, a method for generating curricula in reward-free exploration to train robust world models that generalize across diverse environments, improving adaptability and performance.
Contribution
The paper proposes a novel approach to create curricula for reward-free training, linking minimax regret to model error, and introduces the WAKER algorithm for robust world model learning.
Findings
WAKER outperforms baselines in robustness and generalization.
WAKER improves training efficiency in diverse environments.
Theoretical connection between minimax regret and model error.
Abstract
There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection…
Peer Reviews
Decision·ICLR 2024 poster
- (Relevance) The paper tackles a very relevant problem, which is learning a robust model of the world from reward-free interactions with a class of environments; - (Originality) The paper contributes some nice ideas, such as considering the model error in expectation under the worst-case policy.
- (Presentation) The first part of the paper seems to set a very ambitious goal, which is to tackle the general problem of learning a model of a class of environments that allows to approximately solve any RL task within the class. I would rather present a narrower scope of extending domain randomisation to reward-free settings through a model-based approach; - (Strong assumptions) The set of assumptions does not look completely reasonable. Especially assuming the knowledge of a decoder that map
- The paper is very well written, and provides sufficient background for readers unfamiliar with the problem to appreciate their theoretical result (Proposition 1). The theoretical treatment is not particularly surprising in itself, but helps motivate the proposed method which I appreciate. Assumptions are clearly stated (perfect model and latent policy exists), which seem pretty reasonable to make for the sake of analysis. - The proposed method is intuitive and appears to be fairly straightforw
I have three concerns at the moment: - **Contributions.** While I really do consider simplicity a strength, the proposed method is undeniably incremental. At surface level, the main algorithmic contribution is automatic curricula by sampling of environments for which the current model error is large, which -- again, at surface level -- has been explored in numerous prior works as a means of guiding e.g. exploration, curricula for which rewards are available, gradient updates, or as model regula
* **Presentation**: the paper is well-written and the presentation is clear and clean. The metrics plotted in the Results section are sensible and the empirical results well reflect the story of the work. * **Method**: the idea behind this work is innovative and well-motivated. Given the absence of previous work in this specific setting, I think the method will be useful for future work. The results also corroborate the usefulness of the approach.
* **Breaking assumptions in implementation**: I think the approach is well-motivated theoretically. However, in practice, the authors adopted a world model implementation that learns the dynamics using a recurrent module. By exploiting memory, the model is not encouraged to learn a Markovian latent state space. Given this is one of the assumptions behind the work, I think the authors should have opted for a memoryless model (e.g. learning from sequences of data but without keeping a memory) or t
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · AI-based Problem Solving and Planning · Machine Learning and Algorithms
