Policy Improvement via Imitation of Multiple Oracles

Ching-An Cheng; Andrey Kolobov; Alekh Agarwal

arXiv:2007.00795·cs.LG·December 8, 2020·6 cites

Policy Improvement via Imitation of Multiple Oracles

Ching-An Cheng, Andrey Kolobov, Alekh Agarwal

PDF

Open Access 1 Video

TL;DR

This paper introduces MAMBA, a novel imitation learning algorithm that effectively leverages multiple suboptimal oracles by using a state-wise maximum value baseline, enabling faster and more robust policy learning.

Contribution

It proposes a new baseline for multi-oracle imitation learning and develops MAMBA, a policy optimization algorithm with theoretical guarantees to outperform individual oracles.

Findings

01

MAMBA outperforms existing IL methods in speed and policy quality.

02

The state-wise maximum oracle value serves as an effective benchmark.

03

MAMBA can leverage multiple weak oracles to improve learning efficiency.

Abstract

Despite its promise, reinforcement learning's real-world adoption has been hampered by the need for costly exploration to learn a good policy. Imitation learning (IL) mitigates this shortcoming by using an oracle policy during training as a bootstrap to accelerate the learning process. However, in many practical situations, the learner has access to multiple suboptimal oracles, which may provide conflicting advice in a state. The existing IL literature provides a limited treatment of such scenarios. Whereas in the single-oracle case, the return of the oracle's policy provides an obvious benchmark for the learner to compete against, neither such a benchmark nor principled ways of outperforming it are known for the multi-oracle setting. In this paper, we propose the state-wise maximum of the oracle policies' values as a natural baseline to resolve conflicting advice from multiple oracles.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Policy Improvement via Imitation of Multiple Oracles· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings