Square-root regret bounds for continuous-time episodic Markov decision   processes

Xuefeng Gao; Xun Yu Zhou

arXiv:2210.00832·cs.LG·October 4, 2023·1 cites

Square-root regret bounds for continuous-time episodic Markov decision processes

Xuefeng Gao, Xun Yu Zhou

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning algorithm for continuous-time episodic Markov decision processes, providing square-root regret bounds and demonstrating effectiveness through simulations.

Contribution

It develops a novel learning algorithm for continuous-time MDPs with theoretical regret bounds and empirical validation, extending RL theory beyond discrete-time models.

Findings

01

Regret bounds are of order square-root in the number of episodes.

02

The proposed algorithm outperforms baseline methods in simulations.

03

Both upper and lower bounds on regret are established.

Abstract

We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times of a continuous-time MDP are exponentially distributed with rate parameters depending on the state--action pair at each transition. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst-case expected regret for the proposed algorithm, and establish a worst-case lower bound, both bounds are of the order of square-root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Smart Grid Energy Management · Advanced Bandit Algorithms Research