Efficient Active Imitation Learning with Random Network Distillation

Emilien Bir\'e; Anthony Kobanda; Ludovic Denoyer; R\'emy Portelas

arXiv:2411.01894·cs.LG·April 15, 2025

Efficient Active Imitation Learning with Random Network Distillation

Emilien Bir\'e, Anthony Kobanda, Ludovic Denoyer, R\'emy Portelas

PDF

Open Access 3 Reviews

TL;DR

This paper introduces RND-DAgger, an active imitation learning method that efficiently reduces expert interventions by detecting out-of-distribution states, improving learning in complex video game and robotic tasks.

Contribution

The paper presents RND-DAgger, a novel active imitation learning approach that uses Random Network Distillation to trigger expert queries only when necessary, reducing reliance on constant expert input.

Findings

01

RND-DAgger reduces expert queries compared to traditional methods.

02

It outperforms existing active imitation learning approaches in video game tasks.

03

The method is effective in robotic locomotion scenarios.

Abstract

Developing agents for complex and underspecified tasks, where no clear objective exists, remains challenging but offers many opportunities. This is especially true in video games, where simulated players (bots) need to play realistically, and there is no clear reward to evaluate them. While imitation learning has shown promise in such domains, these methods often fail when agents encounter out-of-distribution scenarios during deployment. Expanding the training dataset is a common solution, but it becomes impractical or costly when relying on human demonstrations. This article addresses active imitation learning, aiming to trigger expert intervention only when necessary, reducing the need for constant expert input along training. We introduce Random Network Distillation DAgger (RND-DAgger), a new active imitation learning method that limits expert querying by using a learned state-based…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The research problem this work focuses on is good. Increasing the efficiency of imitation learning is an open problem and the most effective way to utilize expert data within IL. Improving the interaction between expert and policy increases the ability of IL to be used with real world problems. The contribution of this work is ok, they propose their method and perform a study to verify their claims. The algorithm is clearly defined and could be implemented from the information given. The expe

Weaknesses

The novelty of this work is minimal. As far as I understand it the methods generally used in this work are all previously known. Dagger and the OOD classifier seem to be like the main components used but are previous work. The statistical rigor needs improvement. The metrics are only averaged over 8 seeds. Is there a reason for only this many? I feel like it should be many more. As well, I want to see confidence intervals on Table 1. There is no failure analysis. I wish there was one on the t

Reviewer 02Rating 6Confidence 4

Strengths

- The paper presents a well-founded motivation, aiming to optimize the timing of expert interventions to reduce overall costs associated with human expertise and minimize the frequency of transitions between human experts and learning agent. - The paper is straightforward, particularly for readers with a background in the domain of unsupervised RL.

Weaknesses

- **Novelty:** The paper's novelty appears constrained, as it predominantly builds upon an established novelty measure within the unsupervised RL domain. This integration approach mirrors Ensemble-Dagger, which relies on a principle similar to Disagreement in the unsupervised RL domain. - **Limitations:** The limitations specific to this method are not sufficiently addressed. Certain limitations noted by the authors for other approaches may also apply here. For instance, the paper states, “Whil

Reviewer 03Rating 6Confidence 3

Strengths

- The idea is very simple and focuses on a gap that previous approaches do not consider (i.e. OOD states rather than action mismatch). - The algorithm appears to be practical compared to existing approaches.

Weaknesses

I am happy to increase the scores if these comments are addressed. **Comments** - I am unsure why the formulation is POMDP rather than MDP. The environments used in this paper appears to be using state-based information, in the sense that I cannot really guarantee that they are partially-observable, as opposed to using images that will be way more convincing. - I further believe that the current method assumes the observation aliasing is not a problem. Self-driving environment can dramatically

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Robot Manipulation and Learning