Discovering Minimal Reinforcement Learning Environments

Jarek Liesen; Chris Lu; Andrei Lupu; Jakob N. Foerster; Henning; Sprekeler; Robert T. Lange

arXiv:2406.12589·cs.LG·June 19, 2024

Discovering Minimal Reinforcement Learning Environments

Jarek Liesen, Chris Lu, Andrei Lupu, Jakob N. Foerster, Henning, Sprekeler, Robert T. Lange

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a method to discover minimal, transferable reinforcement learning environments using meta-learning and synthetic bandits, aiming to improve training efficiency and transferability across algorithms.

Contribution

It extends meta-learning for environment discovery to be invariant to hyperparameters and algorithms, and demonstrates the effectiveness of contextual bandits for environment transferability.

Findings

01

Meta-learning environments invariant to hyperparameters

02

Contextual bandits enable transfer to complex MDPs

03

Synthetic environments can speed up downstream RL tasks

Abstract

Reinforcement learning (RL) agents are commonly trained and evaluated in the same environment. In contrast, humans often train in a specialized environment before being evaluated, such as studying a book before taking an exam. The potential of such specialized training environments is still vastly underexplored, despite their capacity to dramatically speed up training. The framework of synthetic environments takes a first step in this direction by meta-learning neural network-based Markov decision processes (MDPs). The initial approach was limited to toy problems and produced environments that did not transfer to unseen RL algorithms. We extend this approach in three ways: Firstly, we modify the meta-learning algorithm to discover environments invariant towards hyperparameter configurations and learning algorithms. Secondly, by leveraging hardware parallelism and introducing a…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

- The research direction considered, discovering environments in which algorithms can learn quickly, seems novel and has potential to be quite interesting for deployment of RL algorithms. I especially like that use on further downstream tasks, beyond the prescribed inner-loop, are considered. This can have several benefits not only for training RL algorithms but in developing new algorithms with discovered environments for quicker iteration.

Weaknesses

- There seems to a connection between the proposed method and simpler non-meta RL algorithms that just learn the value function, which is only mentioned in passing. The optimal reward of the contextual bandit seems to be the optimal value function at the context (state in the MDP). What is the benefit of meta-learning this, rather than just using monte-carlo returns? Or, what is the benefit of this over just learning the value function itself? - Important claims made at the beginning of the pape

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

1. Using a simpler model (synthetic contextual bandit, SCB) as a proxy is an interesting idea when environment is complex. 2. Experiments on interpretability justify the choice of SCB as the proxy.

Weaknesses

One of the main contributions of this paper is the use of a synthetic contextual bandit (SCB) as a proxy to the real environment. However, it is important to note that: * A contextual bandit (CB) can be converted to a Markov decision process (MDP), but not vice versa, because CB is stateless. This means that the SCB model may not be able to accurately capture the dynamics of more complex environments, such as Go, where state is essential for planning and decision-making. * It is also unclear ho

Reviewer 03Rating 3· reject, not good enoughConfidence 2

Strengths

This paper present extensive experiment results to demonstrate and explain the proposed method. Firstly, the authors shows the performance of meta-trained synthetic environments on multiple openAI gym environments. Further, the authors show that training contextual bandits is sufficient to train RL agents which significantly reduce the number of parameters needed through empirical study. The idea of simplify an RL problem to a CB by eliminating the transition probability makes the entire framew

Weaknesses

I found this paper very hard to follow. Some necessary background knowledge is not introduced in the paper. Although this paper gives empirical proof for the statements, the intuition behind the ideas are now explained. Without any explanation, I am not convinced with the claimed results. For example, the authors claimed that the state transition probability is not necessary to train well-performing agents. I read the experimental result, but I cannot understand the intuition behind it.

Code & Models

Repositories

keraJLi/synthetic-gymnax
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsSparse Evolutionary Training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings