EvIL: Evolution Strategies for Generalisable Imitation Learning

Silvia Sapora; Gokul Swamy; Chris Lu; Yee Whye Teh; Jakob Nicolaus; Foerster

arXiv:2406.11905·cs.NE·June 19, 2024

EvIL: Evolution Strategies for Generalisable Imitation Learning

Silvia Sapora, Gokul Swamy, Chris Lu, Yee Whye Teh, Jakob Nicolaus, Foerster

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces EvIL, an evolution strategies-based method that improves reward shaping and transfer in imitation learning, enabling more efficient policy re-training across different environments.

Contribution

The paper proposes a novel evolution-strategies approach, EvIL, to enhance reward shaping and transfer in imitation learning, addressing weaknesses of existing deep IL algorithms.

Findings

01

Reward model ensembles improve transfer performance.

02

EvIL accelerates policy re-training in new environments.

03

Method outperforms prior work in sample efficiency.

Abstract

Often times in imitation learning (IL), the environment we collect expert demonstrations in and the environment we want to deploy our learned policy in aren't exactly the same (e.g. demonstrations collected in simulation but deployment in the real world). Compared to policy-centric approaches to IL like behavioural cloning, reward-centric approaches like inverse reinforcement learning (IRL) often better replicate expert behaviour in new environments. This transfer is usually performed by optimising the recovered reward under the dynamics of the target environment. However, (a) we find that modern deep IL algorithms frequently recover rewards which induce policies far weaker than the expert, even in the same environment the demonstrations were collected in. Furthermore, (b) these rewards are often quite poorly shaped, necessitating extensive environment interaction to optimise…

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 3· reject, not good enoughConfidence 3

Strengths

- The ability to predict an expert’s behavior in an environment different from the one where the trajectory data was collected is an important topic - Able to handle undifferentiable rewards.

Weaknesses

- Since the paper only uses experiments to demonstrate the method's effectiveness, I believe that the experiments are too few and too simple. Usually, when someone talks about the generalization of RL regarding the transition function, I will think of more challenging and practical modifications like different gravity, mass, and friction. - The method is very slow since it involves multiple rounds of training. - The authors do not clarify why we should use their design since the method is not t

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

It is certain that evolution can provide a solutions to any problem. Originality of the proposal includes: • estimating the gradients of policy and transition model parameters with Gaussian mutation and perform gradient decent, rather than usual selection. • tuning of inner-loop steps to avoid disappearance of the above gradient • L1 distillation of the estimated reward functions

Weaknesses

A crucial issue in evolutionary approach is how practically and competitively a real-world problem can be solved. Figure 1 presents the comparison with AIRL, but the implementation of AIRL is not sufficiently documented. The x axis is the number of outer loops, but how many interactions with the environment happened in each outer loop for EvIL and AIRL? The basic parameters like the population size N should be reported in the main text.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The paper presents an interesting application of ES, taking advantage of the fact that neither the reward nor the training - procedure need to differentiable. - The approach addresses some key limitations of most IL approaches - Baseline comparisons and ablation tests are sensible - Promising results in three benchmarks

Weaknesses

- Some acronomys are not defined, e.g. AIRL (adverserial inverse reinforcement learning). What are its training setup? - Approach should be evaluated on more environments and potentially complexer ones. Is the approach generally better to BC or does it depend on the type of environment? - It would be useful to test how transferable the learned reward functions are to similar domains - Hyperparaemters for some of the experiment steps don’t seem to be mentioned (e.g. learning rates, etc.?) - The e

Code & Models

Repositories

SilviaSapora/evil
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Reinforcement Learning in Robotics · Robot Manipulation and Learning