Efficient Reinforcement Learning in Probabilistic Reward Machines
Xiaofeng Lin, Xuezhou Zhang

TL;DR
This paper introduces the first efficient reinforcement learning algorithm for Probabilistic Reward Machines, achieving near-optimal regret bounds and enabling reward-free exploration in non-Markovian reward settings, with strong empirical performance.
Contribution
The paper presents a novel RL algorithm for PRMs with improved regret bounds and a new simulation lemma for non-Markovian rewards, advancing the state of the art.
Findings
Achieves regret bound of ( ext{sqrt}{HOAT} + H^2O^2A^{3/2} + H ext{sqrt}{T})
Matches lower bound ( ext{sqrt}{HOAT}) under certain conditions
Demonstrates superior empirical performance over prior methods
Abstract
In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of , where is the time horizon, is the number of observations, is the number of actions, and is the number of time-steps. This result improves over the best-known bound, of \citet{pmlr-v206-bourel23a} for MDPs with Deterministic Reward Machines (DRMs), a special case of PRMs. When and , our regret bound leads to a regret of , which matches the established lower bound of for MDPs with DRMs up to a logarithmic factor. To the best of our knowledge, this is the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStatistical and Computational Modeling
