Policy-Value Alignment and Robustness in Search-based Multi-Agent   Learning

Niko A. Grupen; Michael Hanlon; Alexis Hao; Daniel D. Lee; Bart Selman

arXiv:2301.11857·cs.AI·February 7, 2023

Policy-Value Alignment and Robustness in Search-based Multi-Agent Learning

Niko A. Grupen, Michael Hanlon, Alexis Hao, Daniel D. Lee, Bart Selman

PDF

Open Access

TL;DR

This paper analyzes the brittleness of AlphaZero, identifies policy-value misalignment and value inconsistency issues, and proposes VISA-VIS to improve alignment and robustness, significantly enhancing performance and reliability.

Contribution

It introduces VISA-VIS, a novel method that improves policy-value alignment and value robustness in search-based multi-agent learning algorithms like AlphaZero.

Findings

01

Reduces policy-value misalignment by up to 76%

02

Decreases value generalization error by up to 50%

03

Lowers average value error by up to 55%

Abstract

Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Optimization and Search Problems · Mobile Crowdsensing and Crowdsourcing

Methodsfail · AlphaZero