Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration

Heyang Zhao; Xingrui Yu; David M. Bossens; Ivor W. Tsang; Quanquan Gu

arXiv:2506.20307·cs.LG·June 26, 2025

Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration

Heyang Zhao, Xingrui Yu, David M. Bossens, Ivor W. Tsang, Quanquan Gu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ILDE, a novel imitation learning algorithm that combines optimistic policy optimization and curiosity-driven exploration to learn efficiently from limited demonstrations and surpass expert performance.

Contribution

The paper proposes ILDE, a new imitation learning method that enhances exploration and achieves beyond-expert performance with fewer demonstrations.

Findings

01

ILDE outperforms state-of-the-art algorithms in sample efficiency.

02

ILDE achieves beyond-expert performance on Atari and MuJoCo tasks.

03

Theoretical analysis shows sublinear regret growth for ILDE.

Abstract

Imitation learning is a central problem in reinforcement learning where the goal is to learn a policy that mimics the expert's behavior. In practice, it is often challenging to learn the expert policy from a limited number of demonstrations accurately due to the complexity of the state space. Moreover, it is essential to explore the environment and collect data to achieve beyond-expert performance. To overcome these challenges, we propose a novel imitation learning algorithm called Imitation Learning with Double Exploration (ILDE), which implements exploration in two aspects: (1) optimistic policy optimization via an exploration bonus that rewards state-action pairs with high uncertainty to potentially improve the convergence to the expert policy, and (2) curiosity-driven exploration of the states that deviate from the demonstration trajectories to potentially yield beyond-expert…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 2

Strengths

- This paper studies a seemingly interesting problem of learning a policy with imitation learning to be much better than the demonstration. - The paper offer some theoretical analysis of the proposed method.

Weaknesses

- The problem formulation of optimizing the GAIL loss + intrinsic reward to achieve a better performance than the expert is not convincing. - The proposed method is only evaluated on limited tasks and the results lack explanation.

Reviewer 02Rating 8Confidence 4

Strengths

I am not the best theoretician by any means, but this paper is very theoretically sound. I think the regret analysis was done well, which is the main theoretical contribution of this work. Experimental results were quite expansive and good. The authors seemed to ablate the right variables (e.g. what happens if we turn our state-action based exploration bonus off, what happens if we turn our imitation reward off) to analyze exactly what is different about their method compared to a standard base

Weaknesses

It would be interesting to see how just the state-action exploration bonus does as well (e.g. the state entropy bonus stays, but the GIRIL-like intrinsic reward is not used). To me, both of these methods seem to be doing the same thing (one on the state-action space as a whole, and one focusing on the expert trajectories), so it would be good to see an experiment that keeps the state entropy bonus, doesn't use the demonstration-based intrinsic reward, and keeps the standard imitation learning re

Reviewer 03Rating 3Confidence 4

Strengths

1. This paper proposes a novel ILDE framework, which combines optimistic exploration with curiosity-driven exploration for imitation learning.

Weaknesses

1. **Theoretical Limitations** 1. First, the assumptions made in this paper are overly strong. In Assumption 4.2, the authors directly assume that the statistical error resulting from using a finite set of expert demonstrations is small. This assumption is both strong and unreasonable, as a primary goal in imitation learning theory is to examine the relationship between the number of expert trajectories and the error bound [1]. Additionally, Assumption 4.1 presumes that the loss function is

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Multimodal Machine Learning Applications