Online Apprenticeship Learning
Lior Shani, Tom Zahavy, Shie Mannor

TL;DR
This paper introduces an online apprenticeship learning algorithm that efficiently learns policies from expert trajectories without solving an MDP at each step, achieving low regret and good performance in high-dimensional control tasks.
Contribution
We propose a novel online apprenticeship learning method combining mirror descent algorithms, avoiding repeated MDP solutions, and demonstrate its effectiveness with a deep variant similar to GAIL.
Findings
Achieves $O(\sqrt{K})$ regret with optimistic exploration.
Avoids solving MDPs at each iteration, improving practicality.
Performs well in high-dimensional control environments.
Abstract
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function. Instead, we observe trajectories sampled by an expert that acts according to some policy. The goal is to find a policy that matches the expert's performance on some predefined set of cost functions. We introduce an online variant of AL (Online Apprenticeship Learning; OAL), where the agent is expected to perform comparably to the expert while interacting with the environment. We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms: one for policy optimization and another for learning the worst case cost. By employing optimistic exploration, we derive a convergent algorithm with regret, where is the number of interactions with the MDP, and an additional linear error term that depends on the amount of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Optimization and Search Problems
MethodsGenerative Adversarial Imitation Learning
