Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes
Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, Peinan Zhang

TL;DR
This paper introduces Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online reinforcement learning, and combines it with policy gradient methods to improve learning in non-Markov decision processes, validated by numerical experiments.
Contribution
It proposes MCTL for online RL, combines it with policy gradient algorithms, and provides convergence analysis and empirical validation.
Findings
MCTL effectively adapts MCTS for online RL environments.
The combined PG and MCTL approach converges under certain conditions.
Numerical experiments demonstrate the method's effectiveness.
Abstract
Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues. As another successful RL approach, algorithms based on Monte Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results, especially in the game-playing domain. They are also effective when applied to non-Markov decision processes. However, the standard MCTS is a method for decision-time planning, which differs from the online RL setting. In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL setups. We then explore a combined policy approach of PG and MCTL to leverage their strengths. We derive conditions for asymptotic convergence with the results of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Sports Analytics and Performance
MethodsAlphaZero · Monte-Carlo Tree Search
