Learning Reward Machines: A Study in Partially Observable Reinforcement   Learning

Rodrigo Toro Icarte; Ethan Waldie; Toryn Q. Klassen; Richard; Valenzano; Margarita P. Castro; Sheila A. McIlraith

arXiv:2112.09477·cs.LG·December 20, 2021·1 cites

Learning Reward Machines: A Study in Partially Observable Reinforcement Learning

Rodrigo Toro Icarte, Ethan Waldie, Toryn Q. Klassen, Richard, Valenzano, Margarita P. Castro, Sheila A. McIlraith

PDF

Open Access

TL;DR

This paper introduces a method for learning reward machines from experience in reinforcement learning, enabling better problem decomposition and improved performance in partially observable environments.

Contribution

It presents a novel approach to learn reward machines automatically, enhancing structured problem decomposition in RL without prior specification.

Findings

01

Outperforms A3C, PPO, and ACER in three domains

02

Effectively solves partially observable RL problems

03

Demonstrates advantages and limitations of learned reward machines

Abstract

Reinforcement learning (RL) is a central problem in artificial intelligence. This problem consists of defining artificial agents that can learn optimal behaviour by interacting with an environment -- where the optimal behaviour is defined with respect to a reward signal that the agent seeks to maximize. Reward machines (RMs) provide a structured, automata-based representation of a reward function that enables an RL agent to decompose an RL problem into structured subproblems that can be efficiently learned via off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReceptor Mechanisms and Signaling · Diabetes Treatment and Management · Heart Failure Treatment and Management

MethodsTrust Region Policy Optimization · Retrace · Convolution · Softmax · Experience Replay · Entropy Regularization · *Communicated@Fast*How Do I Communicate to Expedia? · Proximal Policy Optimization · Stochastic Dueling Network · Dense Connections