Policy-Based Self-Competition for Planning Problems
Jonathan Pirnay, Quirin G\"ottl, Jakob Burger, Dominik Gerhard Grimm

TL;DR
This paper introduces GAZ PTP, a novel planning algorithm that enhances single-player problem solving by incorporating self-competition with historical policies, leading to improved performance in combinatorial optimization tasks.
Contribution
The paper proposes GAZ PTP, a new self-competition method that integrates past policies into planning, outperforming existing GAZ variants in optimization problems.
Findings
GAZ PTP outperforms single-player GAZ variants with half the search budget.
Effective in combinatorial optimization problems like TSP and Job-Shop Scheduling.
Demonstrates the benefit of using historical policies in planning algorithms.
Abstract
AlphaZero-type algorithms may stop improving on single-player tasks in case the value network guiding the tree search is unable to approximate the outcome of an episode sufficiently well. One technique to address this problem is transforming the single-player task through self-competition. The main idea is to compute a scalar baseline from the agent's historical performances and to reshape an episode's reward into a binary output, indicating whether the baseline has been exceeded or not. However, this baseline only carries limited information for the agent about strategies how to improve. We leverage the idea of self-competition and directly incorporate a historical policy into the planning process instead of its scalar performance. Based on the recently introduced Gumbel AlphaZero (GAZ), we propose our algorithm GAZ 'Play-to-Plan' (GAZ PTP), in which the agent learns to find strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAuction Theory and Applications · Game Theory and Applications · Constraint Satisfaction and Optimization
MethodsAlphaZero
