Reinforcement Learning with Non-Markovian Rewards
Maor Gaon, Ronen I. Brafman

TL;DR
This paper introduces new reinforcement learning algorithms capable of handling non-Markovian rewards by combining classical RL methods with automata learning, and proves their convergence to optimal policies.
Contribution
It presents the first principled RL algorithms for non-Markovian rewards, integrating automata learning with Q-learning and R-max, with proven convergence guarantees.
Findings
Empirical evaluation of four combined RL algorithms with automata learning.
Proof of convergence to optimal policies for some variants.
Improved handling of non-Markovian reward structures in RL.
Abstract
The standard RL world model is that of a Markov Decision Process (MDP). A basic premise of MDPs is that the rewards depend on the last state and action only. Yet, many real-world rewards are non-Markovian. For example, a reward for bringing coffee only if requested earlier and not yet served, is non-Markovian if the state only records current requests and deliveries. Past work considered the problem of modeling and solving MDPs with non-Markovian rewards (NMR), but we know of no principled approaches for RL with NMR. Here, we address the problem of policy learning from experience with such rewards. We describe and evaluate empirically four combinations of the classical RL algorithm Q-learning and R-max with automata learning algorithms to obtain new RL algorithms for domains with NMR. We also prove that some of these variants converge to an optimal policy in the limit.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsQ-Learning
