Reward is enough for convex MDPs

Tom Zahavy; Brendan O'Donoghue; Guillaume Desjardins; Satinder; Singh

arXiv:2106.00661·cs.AI·June 5, 2023·1 cites

Reward is enough for convex MDPs

Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, Satinder, Singh

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that for convex MDPs, optimizing reward alone is insufficient to achieve certain goals, and introduces a duality-based framework to solve these problems.

Contribution

It extends the standard RL framework to convex MDPs, showing reward is not enough, and proposes a unified meta-algorithm using Fenchel duality.

Findings

01

Convex MDPs cannot be formulated with stationary reward functions.

02

A min-max game reformulation unifies existing algorithms.

03

The proposed approach handles a broader class of RL problems.

Abstract

Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov decision process (MDP). However, not all goals can be captured in this manner. In this paper we study convex MDPs in which goals are expressed as convex functions of the stationary distribution and show that they cannot be formulated using stationary reward functions. Convex MDPs generalize the standard reinforcement learning (RL) problem formulation to a larger framework that includes many supervised and unsupervised RL problems, such as apprenticeship learning, constrained MDPs, and so-called `pure exploration'. Our approach is to reformulate the convex MDP problem as a min-max game involving policy and cost (negative reward) `players', using Fenchel duality. We propose a meta-algorithm for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Reward is enough for convex MDPs· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Electric Vehicles and Infrastructure