Self-Play Learning Without a Reward Metric

Dan Schmidt; Nick Moran; Jonathan S. Rosenfeld; Jonathan Rosenthal,; Jonathan Yedidia

arXiv:1912.07557·cs.LG·December 17, 2019

Self-Play Learning Without a Reward Metric

Dan Schmidt, Nick Moran, Jonathan S. Rosenfeld, Jonathan Rosenthal,, Jonathan Yedidia

PDF

Open Access

TL;DR

This paper introduces a modified AlphaZero algorithm that learns strategy games using only a total ordering of outcomes, eliminating the need for explicit reward balancing.

Contribution

It presents a novel approach to self-play learning that removes the requirement for a quantitative reward function, simplifying the training process.

Findings

01

Learns optimal play in comparable time to AlphaZero

02

Does not require reward component balancing

03

Effective in a sample game

Abstract

The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Sports Analytics and Performance · Reinforcement Learning in Robotics

MethodsAlphaZero