Reinforcement Learning for Task Specifications with Action-Constraints
Arun Raman, Keerthan Shagrithaya, Shalabh Bhatnagar

TL;DR
This paper introduces an automata-based reinforcement learning method that incorporates safety constraints on action sequences into policy learning for Markov Decision Processes, ensuring safe behavior during learning.
Contribution
It combines supervisory control theory with Q-learning by using automata to enforce non-Markovian safety constraints, advancing safe reinforcement learning techniques.
Findings
Automata-based constraints effectively enforce safety in RL.
The method successfully learns optimal policies under complex constraints.
Simulation results demonstrate improved safety and performance.
Abstract
In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the Q-learning algorithm for learning optimal policies in the presence of non-Markovian action-sequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPetri Nets in System Modeling · Formal Methods in Verification · Distributed systems and fault tolerance
MethodsQ-Learning
