Logarithmic regret bounds for continuous-time average-reward Markov decision processes
Xuefeng Gao, Xun Yu Zhou

TL;DR
This paper establishes logarithmic regret bounds for reinforcement learning in continuous-time Markov decision processes, introducing a novel algorithm with proven finite-time performance guarantees in the average-reward setting.
Contribution
It provides the first instance-dependent regret lower bounds and a corresponding learning algorithm for continuous-time MDPs, extending RL theory beyond discrete-time models.
Findings
Regret lower bounds are logarithmic in the time horizon.
A new learning algorithm achieves logarithmic regret growth.
Analysis uses upper confidence bounds and stochastic comparison techniques.
Abstract
We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBehavioral Health and Interventions · Mental Health Research Topics · Decision-Making and Behavioral Economics
