Self-Play Learning Without a Reward Metric
Dan Schmidt, Nick Moran, Jonathan S. Rosenfeld, Jonathan Rosenthal,, Jonathan Yedidia

TL;DR
This paper introduces a modified AlphaZero algorithm that learns strategy games using only a total ordering of outcomes, eliminating the need for explicit reward balancing.
Contribution
It presents a novel approach to self-play learning that removes the requirement for a quantitative reward function, simplifying the training process.
Findings
Learns optimal play in comparable time to AlphaZero
Does not require reward component balancing
Effective in a sample game
Abstract
The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Sports Analytics and Performance · Reinforcement Learning in Robotics
MethodsAlphaZero
