Learning Reward Machines: A Study in Partially Observable Reinforcement Learning
Rodrigo Toro Icarte, Ethan Waldie, Toryn Q. Klassen, Richard, Valenzano, Margarita P. Castro, Sheila A. McIlraith

TL;DR
This paper introduces a method for learning reward machines from experience in reinforcement learning, enabling better problem decomposition and improved performance in partially observable environments.
Contribution
It presents a novel approach to learn reward machines automatically, enhancing structured problem decomposition in RL without prior specification.
Findings
Outperforms A3C, PPO, and ACER in three domains
Effectively solves partially observable RL problems
Demonstrates advantages and limitations of learned reward machines
Abstract
Reinforcement learning (RL) is a central problem in artificial intelligence. This problem consists of defining artificial agents that can learn optimal behaviour by interacting with an environment -- where the optimal behaviour is defined with respect to a reward signal that the agent seeks to maximize. Reward machines (RMs) provide a structured, automata-based representation of a reward function that enables an RL agent to decompose an RL problem into structured subproblems that can be efficiently learned via off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReceptor Mechanisms and Signaling · Diabetes Treatment and Management · Heart Failure Treatment and Management
MethodsTrust Region Policy Optimization · Retrace · Convolution · Softmax · Experience Replay · Entropy Regularization · *Communicated@Fast*How Do I Communicate to Expedia? · Proximal Policy Optimization · Stochastic Dueling Network · Dense Connections
