Learning Reward Machines from Partially Observed Policies
Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay

TL;DR
This paper introduces a method to learn reward machines from partial policy information using a SAT-based algorithm, enabling exact recovery of the reward structure in various complex environments.
Contribution
It presents a novel approach combining prefix tree policies and SAT solving to identify reward machines from limited data, extending to real-world and continuous scenarios.
Findings
Successfully recovers reward machines in discrete and continuous environments.
Effective with limited demonstration data from optimal policies.
Demonstrated on real data from mice experiments.
Abstract
Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy or demonstrations by an expert. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Stock Market Forecasting Methods · Machine Learning in Healthcare
