Policy Gradient Bayesian Robust Optimization for Imitation Learning

Zaynah Javed; Daniel S. Brown; Satvik Sharma; Jerry Zhu; Ashwin; Balakrishna; Marek Petrik; Anca D. Dragan; Ken Goldberg

arXiv:2106.06499·cs.LG·June 23, 2021·1 cites

Policy Gradient Bayesian Robust Optimization for Imitation Learning

Zaynah Javed, Daniel S. Brown, Satvik Sharma, Jerry Zhu, Ashwin, Balakrishna, Marek Petrik, Anca D. Dragan, Ken Goldberg

PDF

Open Access 1 Video

TL;DR

This paper introduces PG-BROIL, a novel policy gradient method for imitation learning that robustly handles reward ambiguity and risk preferences, outperforming existing algorithms in uncertain environments.

Contribution

The paper presents PG-BROIL, the first scalable policy optimization algorithm robust to reward hypothesis distributions, enabling risk-sensitive behavior in continuous MDPs.

Findings

01

PG-BROIL effectively balances expected performance and risk.

02

It outperforms state-of-the-art imitation learning methods.

03

The approach produces a spectrum of behaviors from risk-neutral to risk-averse.

Abstract

The difficulty in specifying rewards for many real-world problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Policy Gradient Bayesian Robust Optimization for Imitation Learning· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Explainable Artificial Intelligence (XAI)