# Learning Policies from Self-Play with Policy Gradients and MCTS Value   Estimates

**Authors:** Dennis J. N. J. Soemers, \'Eric Piette, Matthew Stephenson, Cameron, Browne

arXiv: 1905.05809 · 2019-05-16

## TL;DR

This paper introduces a new policy training method using policy gradients and MCTS value estimates, focusing on reducing exploration for interpretability rather than maximizing game-playing performance.

## Contribution

It proposes a novel objective for training less exploratory policies and derives a policy gradient method using MCTS value estimates instead of visit counts.

## Key findings

- Policies trained with the new method show reduced exploration.
- The approach enables extraction of more interpretable strategies.
- Empirical evaluation across various board games demonstrates its effectiveness.

## Abstract

In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. In this paper, we are interested in learning policies for a project with future goals including the extraction of interpretable strategies, rather than state-of-the-art game-playing performance. For these goals, we argue that such an extent of exploration is undesirable, and we propose a novel objective function for training policies that are not exploratory. We derive a policy gradient expression for maximising this objective function, which can be estimated using MCTS value estimates, rather than MCTS visit counts. We empirically evaluate various properties of resulting policies, in a variety of board games.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.05809/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1905.05809/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/1905.05809/full.md

---
Source: https://tomesphere.com/paper/1905.05809