Blending Imitation and Reinforcement Learning for Robust Policy Improvement

Xuefeng Liu; Takuma Yoneda; Rick L. Stevens; Matthew R. Walter; Yuxin Chen

arXiv:2310.01737·cs.LG·August 12, 2025·2 cites

Blending Imitation and Reinforcement Learning for Robust Policy Improvement

Xuefeng Liu, Takuma Yoneda, Rick L. Stevens, Matthew R. Walter, Yuxin Chen

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces RPI, a novel algorithm that combines imitation and reinforcement learning to improve sample efficiency and robustness in policy learning, especially in sparse-reward environments.

Contribution

It presents a new method that interleaves IL and RL, using online performance estimates to adaptively switch between them, outperforming existing approaches.

Findings

01

RPI achieves superior performance on benchmark tasks.

02

Theoretical analysis confirms its robustness and efficiency.

03

Empirical results show effective learning from diverse oracles.

Abstract

While reinforcement learning (RL) has shown promising performance, its sample complexity continues to be a substantial hurdle, restricting its broader application across a variety of domains. Imitation learning (IL) utilizes oracles to improve sample efficiency, yet it is often constrained by the quality of the oracles deployed. which actively interleaves between IL and RL based on an online estimate of their performance. RPI draws on the strengths of IL, using oracle queries to facilitate exploration, an aspect that is notably challenging in sparse-reward RL, particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. Paper writing is clear to understand. And the method is natural. 2. The theoretical analysis is adequate.

Weaknesses

(1) Novelty. - The policy improvement of perfect knowledge is similar to making ensembles of several imitated policies. The theoretical analysis of the method seems a little redundant. - The exploration in RPI is uniformly random sampling (line 3 of Alg.1). These seem trivial. (2) The experimental setting is not clear and sufficient. - How many demonstrates for imitation learning? How many online interactions for reinforcement learning? The x-axis of the curves is training step, what is 100

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

Motivation * The work is well-motivated as an approach to learn from multiple oracles and improve upon them. Building frameworks that can blend between existing knowledge via oracles and exploration using a learners policy seems like a good way to enhance the setting of only learning from oracles. Structural clarity * The paper is well structured and well written. The flow is very clear. I do have to say that I did get lost in the notation details at times because there is a lot of notation tha

Weaknesses

Contextualization with prior work Before I start, I would like to mention that I am not familiar with this sub-field of RL but do know the standard online and offline RL literature quite well. * The prior work section is rather brief with a total of 6 citations. I am not familiar with the exact sub-field but this seems rather little given the literature is several years old. * My first main concern is related to the contextualization of various parts of the paper to previous work. I think it c

Reviewer 03Rating 8· accept, good paperConfidence 2

Strengths

- The paper is clear and well written. The authors have done a good job of presenting an overview of their method, giving theoretical justification of why the method works, and benchmarking against several state-of-the-art baselines. - The experiments are detailed, and I appreciate the ablation studies which show the usefulness of the robust policy selection rule vs naive rule, and the transition from utilizing the expert policies early in training while transitioning smoothly to RL as the polic

Weaknesses

- For reproducibility purposes, it would be great if the authors could include a table of algorithm hyperparameters used in the appendix, and hyperparameter sweeping strategies.

Videos

Blending Imitation and Reinforcement Learning for Robust Policy Improvement· slideslive

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Reinforcement Learning in Robotics