On-line Policy Improvement using Monte-Carlo Search

Gerald Tesauro; Gregory R. Galperin

arXiv:2501.05407·cs.LG·April 7, 2025·212 cites

On-line Policy Improvement using Monte-Carlo Search

Gerald Tesauro, Gregory R. Galperin

PDF

Open Access

TL;DR

This paper introduces a Monte-Carlo simulation algorithm for real-time policy improvement in adaptive controllers, demonstrating significant error reduction in backgammon and potential for broader applications.

Contribution

The paper presents a parallelizable Monte-Carlo algorithm for policy improvement that effectively enhances performance across various initial policies.

Findings

01

Substantial error reduction, up to a factor of 5, in backgammon playing strength.

02

Effective across a range of initial policies, including neural network-based strategies.

03

Algorithm is easily parallelizable and applicable to other simulation-based adaptive control tasks.

Abstract

We present a Monte-Carlo simulation algorithm for real-time policy improvement of an adaptive controller. In the Monte-Carlo simulation, the long-term expected reward of each possible action is statistically measured, using the initial policy to make decisions in each step of the simulation. The action maximizing the measured expected reward is then taken, resulting in an improved policy. Our algorithm is easily parallelizable and has been implemented on the IBM SP1 and SP2 parallel-RISC supercomputers. We have obtained promising initial results in applying this algorithm to the domain of backgammon. Results are reported for a wide variety of initial policies, ranging from a random policy to TD-Gammon, an extremely strong multi-layer neural network. In each case, the Monte-Carlo algorithm gives a substantial reduction, by as much as a factor of 5 or more, in the error rate of the base…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems

MethodsAccumulating Eligibility Trace · Dense Connections · TD Lambda · Feedforward Network · TD-Gammon · Balanced Selection