Policy Improvement via Imitation of Multiple Oracles
Ching-An Cheng, Andrey Kolobov, Alekh Agarwal

TL;DR
This paper introduces MAMBA, a novel imitation learning algorithm that effectively leverages multiple suboptimal oracles by using a state-wise maximum value baseline, enabling faster and more robust policy learning.
Contribution
It proposes a new baseline for multi-oracle imitation learning and develops MAMBA, a policy optimization algorithm with theoretical guarantees to outperform individual oracles.
Findings
MAMBA outperforms existing IL methods in speed and policy quality.
The state-wise maximum oracle value serves as an effective benchmark.
MAMBA can leverage multiple weak oracles to improve learning efficiency.
Abstract
Despite its promise, reinforcement learning's real-world adoption has been hampered by the need for costly exploration to learn a good policy. Imitation learning (IL) mitigates this shortcoming by using an oracle policy during training as a bootstrap to accelerate the learning process. However, in many practical situations, the learner has access to multiple suboptimal oracles, which may provide conflicting advice in a state. The existing IL literature provides a limited treatment of such scenarios. Whereas in the single-oracle case, the return of the oracle's policy provides an obvious benchmark for the learner to compete against, neither such a benchmark nor principled ways of outperforming it are known for the multi-oracle setting. In this paper, we propose the state-wise maximum of the oracle policies' values as a natural baseline to resolve conflicting advice from multiple oracles.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
