Discovering Minimal Reinforcement Learning Environments
Jarek Liesen, Chris Lu, Andrei Lupu, Jakob N. Foerster, Henning, Sprekeler, Robert T. Lange

TL;DR
This paper introduces a method to discover minimal, transferable reinforcement learning environments using meta-learning and synthetic bandits, aiming to improve training efficiency and transferability across algorithms.
Contribution
It extends meta-learning for environment discovery to be invariant to hyperparameters and algorithms, and demonstrates the effectiveness of contextual bandits for environment transferability.
Findings
Meta-learning environments invariant to hyperparameters
Contextual bandits enable transfer to complex MDPs
Synthetic environments can speed up downstream RL tasks
Abstract
Reinforcement learning (RL) agents are commonly trained and evaluated in the same environment. In contrast, humans often train in a specialized environment before being evaluated, such as studying a book before taking an exam. The potential of such specialized training environments is still vastly underexplored, despite their capacity to dramatically speed up training. The framework of synthetic environments takes a first step in this direction by meta-learning neural network-based Markov decision processes (MDPs). The initial approach was limited to toy problems and produced environments that did not transfer to unseen RL algorithms. We extend this approach in three ways: Firstly, we modify the meta-learning algorithm to discover environments invariant towards hyperparameter configurations and learning algorithms. Secondly, by leveraging hardware parallelism and introducing a…
Peer Reviews
Decision·Submitted to ICLR 2024
- The research direction considered, discovering environments in which algorithms can learn quickly, seems novel and has potential to be quite interesting for deployment of RL algorithms. I especially like that use on further downstream tasks, beyond the prescribed inner-loop, are considered. This can have several benefits not only for training RL algorithms but in developing new algorithms with discovered environments for quicker iteration.
- There seems to a connection between the proposed method and simpler non-meta RL algorithms that just learn the value function, which is only mentioned in passing. The optimal reward of the contextual bandit seems to be the optimal value function at the context (state in the MDP). What is the benefit of meta-learning this, rather than just using monte-carlo returns? Or, what is the benefit of this over just learning the value function itself? - Important claims made at the beginning of the pape
1. Using a simpler model (synthetic contextual bandit, SCB) as a proxy is an interesting idea when environment is complex. 2. Experiments on interpretability justify the choice of SCB as the proxy.
One of the main contributions of this paper is the use of a synthetic contextual bandit (SCB) as a proxy to the real environment. However, it is important to note that: * A contextual bandit (CB) can be converted to a Markov decision process (MDP), but not vice versa, because CB is stateless. This means that the SCB model may not be able to accurately capture the dynamics of more complex environments, such as Go, where state is essential for planning and decision-making. * It is also unclear ho
This paper present extensive experiment results to demonstrate and explain the proposed method. Firstly, the authors shows the performance of meta-trained synthetic environments on multiple openAI gym environments. Further, the authors show that training contextual bandits is sufficient to train RL agents which significantly reduce the number of parameters needed through empirical study. The idea of simplify an RL problem to a CB by eliminating the transition probability makes the entire framew
I found this paper very hard to follow. Some necessary background knowledge is not introduced in the paper. Although this paper gives empirical proof for the statements, the intuition behind the ideas are now explained. Without any explanation, I am not convinced with the claimed results. For example, the authors claimed that the state transition probability is not necessary to train well-performing agents. I read the experimental result, but I cannot understand the intuition behind it.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsSparse Evolutionary Training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
