Score vs. Winrate in Score-Based Games: which Reward for Reinforcement Learning?
Luca Pasqualini, Gianluca Amato, Marco Fantozzi, Rosa Gini, Alessandro, Marchetti, Carlo Metta, Francesco Morandin, Maurizio Parton

TL;DR
This paper investigates the limitations of training reinforcement learning agents to optimize score differences instead of win/lose outcomes in perfect information games, revealing empirical and theoretical insights into their suboptimality.
Contribution
It provides empirical evidence and a theoretical framework explaining why score-based training may lead to suboptimal policies in deterministic, perfect information games.
Findings
Score-based training often results in suboptimal policies.
Outcome-optimal policies prefer higher score variance in losing states.
Deterministic games can behave like nondeterministic ones under approximation.
Abstract
In the last years, the DeepMind algorithm AlphaZero has become the state of the art to efficiently tackle perfect information two-player zero-sum games with a win/lose outcome. However, when the win/lose outcome is decided by a final score difference, AlphaZero may play score-suboptimal moves because all winning final positions are equivalent from the win/lose outcome perspective. This can be an issue, for instance when used for teaching, or when trying to understand whether there is a better move. Moreover, there is the theoretical quest for the perfect game. A naive approach would be training an AlphaZero-like agent to predict score differences instead of win/lose outcomes. Since the game of Go is deterministic, this should as well produce an outcome-optimal play. However, it is a folklore belief that "this does not work". In this paper, we first provide empirical evidence for this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
MethodsAlphaZero
