Square-root regret bounds for continuous-time episodic Markov decision processes
Xuefeng Gao, Xun Yu Zhou

TL;DR
This paper introduces a reinforcement learning algorithm for continuous-time episodic Markov decision processes, providing square-root regret bounds and demonstrating effectiveness through simulations.
Contribution
It develops a novel learning algorithm for continuous-time MDPs with theoretical regret bounds and empirical validation, extending RL theory beyond discrete-time models.
Findings
Regret bounds are of order square-root in the number of episodes.
The proposed algorithm outperforms baseline methods in simulations.
Both upper and lower bounds on regret are established.
Abstract
We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times of a continuous-time MDP are exponentially distributed with rate parameters depending on the state--action pair at each transition. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst-case expected regret for the proposed algorithm, and establish a worst-case lower bound, both bounds are of the order of square-root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Smart Grid Energy Management · Advanced Bandit Algorithms Research
