Robust Asymmetric Learning in POMDPs
Andrew Warrington, J. Wilder Lavington, Adam \'Scibior, Mark, Schmidt, Frank Wood

TL;DR
This paper introduces a new method called adaptive asymmetric DAgger (A2D) for training policies in POMDPs, addressing the flaw of existing imitation approaches by jointly training an expert and agent to maximize expected reward under partial observability.
Contribution
The paper proposes a novel objective and algorithm (A2D) that trains an expert to maximize the agent’s expected reward, improving imitation safety and performance in POMDPs.
Findings
A2D produces expert policies that are safe for imitation.
A2D outperforms fixed expert imitation in POMDPs.
The method effectively handles partial observability in policy learning.
Abstract
Policies for partially observed Markov decision processes can be efficiently learned by imitating policies for the corresponding fully observed Markov decision processes. Unfortunately, existing approaches for this kind of imitation learning have a serious flaw: the expert does not know what the trainee cannot see, and so may encourage actions that are sub-optimal, even unsafe, under partial information. We derive an objective to instead train the expert to maximize the expected reward of the imitating agent policy, and use it to construct an efficient algorithm, adaptive asymmetric DAgger (A2D), that jointly trains the expert and the agent. We show that A2D produces an expert policy that the agent can safely imitate, in turn outperforming policies learned by imitating a fixed expert.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and ELM · Machine Learning and Algorithms · Neural Networks and Applications
