Discovering a set of policies for the worst case reward

Tom Zahavy; Andre Barreto; Daniel J Mankowitz; Shaobo Hou; Brendan; O'Donoghue; Iurii Kemaev; Satinder Singh

arXiv:2102.04323·cs.AI·December 13, 2021

Discovering a set of policies for the worst case reward

Tom Zahavy, Andre Barreto, Daniel J Mankowitz, Shaobo Hou, Brendan, O'Donoghue, Iurii Kemaev, Satinder Singh

PDF

Open Access 1 Video

TL;DR

This paper introduces a policy iteration algorithm to construct a set of policies that maximizes the worst-case performance across multiple reinforcement learning tasks with linear reward functions, ensuring diverse and robust policy sets.

Contribution

The paper proposes a novel policy iteration method for building diverse policy sets that optimize worst-case performance in multi-task reinforcement learning scenarios.

Findings

01

The algorithm guarantees monotonically improving worst-case performance.

02

Empirical results confirm performance improvements on grid world and DeepMind control tasks.

03

The resulting policy sets are diverse, enabling different behaviors and skills.

Abstract

We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any composition of the former whose performance is at least as good as that of its constituents across all the tasks. We focus on the most conservative instantiation of SIPs, set-max policies (SMPs), so our analysis extends to any SIP. This includes known policy-composition operators like generalized policy improvement. Our main contribution is a policy iteration algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks. The algorithm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Discovering a set of policies for the worst case reward· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Adversarial Robustness in Machine Learning