On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning
Che Wang, Shuhan Yuan, Kai Shao, Keith Ross

TL;DR
This paper proves the almost sure convergence of the Monte Carlo Exploring Starts algorithm in a class of MDPs called Optimal Policy Feed-Forward MDPs, using a novel inductive proof approach based on the strong law of large numbers.
Contribution
It establishes convergence for the original MCES algorithm in a new class of environments, expanding theoretical understanding beyond previous results.
Findings
Proves almost sure convergence of MCES in Optimal Policy Feed-Forward MDPs.
Introduces a simple inductive proof method using the strong law of large numbers.
Extends convergence results to deterministic and episodic environments.
Abstract
A simple and natural algorithm for reinforcement learning (RL) is Monte Carlo Exploring Starts (MCES), where the Q-function is estimated by averaging the Monte Carlo returns, and the policy is improved by choosing actions that maximize the current estimate of the Q-function. Exploration is performed by "exploring starts", that is, each episode begins with a randomly chosen state and action, and then follows the current policy to the terminal state. In the classic book on RL by Sutton & Barto (2018), it is stated that establishing convergence for the MCES algorithm is one of the most important remaining open theoretical problems in RL. However, the convergence question for MCES turns out to be quite nuanced. Bertsekas & Tsitsiklis (1996) provide a counter-example showing that the MCES algorithm does not necessarily converge. Tsitsiklis (2002) further shows that if the original MCES…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Evolutionary Algorithms and Applications
