Regret-Guided Search Control for Efficient Learning in AlphaZero

Yun-Jui Tsai; Wei-Yu Chen; Yan-Ru Ju; Yu-Hung Chang; Ti-Rong Wu

arXiv:2602.20809·cs.LG·February 25, 2026

Regret-Guided Search Control for Efficient Learning in AlphaZero

Yun-Jui Tsai, Wei-Yu Chen, Yan-Ru Ju, Yu-Hung Chang, Ti-Rong Wu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Regret-Guided Search Control (RGSC), a method that improves AlphaZero's learning efficiency by focusing on high-regret states, leading to faster and more robust training across multiple board games.

Contribution

The paper proposes RGSC, a novel extension of AlphaZero that uses a regret network to identify and revisit high-regret states, enhancing learning efficiency and performance.

Findings

01

RGSC outperforms AlphaZero and Go-Exploit in multiple games.

02

RGSC improves win rate against KataGo from 69.3% to 78.2%.

03

Results show RGSC enhances training efficiency and robustness.

Abstract

Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, humans often need only a few games, improving rapidly by repeatedly revisiting states where mistakes occurred. This idea, known as search control, aims to restart from valuable states rather than always from the initial state. In AlphaZero, prior work Go-Exploit applies this idea by sampling past states from self-play or search trees, but it treats all states equally, regardless of their learning potential. We propose Regret-Guided Search Control (RGSC), which extends AlphaZero with a regret network that learns to identify high-regret states, where the agent's evaluation diverges most from the actual outcome. These states are collected from both self-play trajectories and MCTS nodes, stored in a…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

The paper demonstrates strong originality by creatively combining search control concepts (e.g., from Sutton & Barto, 2018, and Go-Exploit) with a novel regret-guided prioritization mechanism tailored to AlphaZero. This includes a ranking-based objective for the regret network, which addresses challenges like imbalance and non-stationarity in regret prediction—issues not fully tackled in prior work like Tavakoli et al. (2020) or Trudeau & Bowling (2023). While building on existing ideas, the app

Weaknesses

1. Adding the regret network increases model complexity (e.g., extra heads and training objectives), but the paper doesn't quantify the overhead in terms of GPU hours or inference time during self-play. This could be addressed by reporting relative costs compared to baselines, ensuring the efficiency gains aren't offset by higher per-iteration expenses. 2. While RGSC shows final Elo improvements, the Elo curves in Figure 4 reveal that advantages are not consistently stable across training, with

Reviewer 02Rating 2Confidence 2

Strengths

1. The paper defines “regret” as the average cumulative deviation from the current state to terminal states. This formulation captures the long-term impact of mistakes while avoiding the locality limitations of single-step error measures. 2. The method is evaluated on several board-game domains, with a reasonably comprehensive experimental suite within that problem class.

Weaknesses

well RGSC generalizes beyond these discrete, perfect-information domains. 2. Dependence on terminal observability: RGSC’s regret computation requires access to complete state-to-terminal outcomes. In many real-world or continuous-control tasks (e.g., robot, Atari), the notion of a single terminal outcome can be ambiguous or trajectories cannot be fully recovered, making the regret signal difficult or impossible to compute reliably. 3. Baselines: the baselines appear outdated (2023 as the mos

Reviewer 03Rating 6Confidence 3

Strengths

- Novel and well-motivated idea: The use of regret-guided search control is an intuitive and principled extension of AlphaZero that better mimics human-style targeted learning. - Clear technical contribution: Introducing a regret network and a ranking-based objective to identify and prioritize high-regret states is a concrete and original addition to the AlphaZero framework. - Strong empirical results: RGSC shows consistent performance improvements across multiple domains (Go, Othello, Hex) and

Weaknesses

- Limited theoretical justification: The paper mainly relies on empirical validation; it lacks a clear theoretical analysis of why regret-based prioritization should improve sample efficiency. - Scalability concerns: Results are shown only on small board sizes (9×9 Go, 10×10 Othello, 11×11 Hex). It’s unclear if RGSC scales to larger or more complex domains (e.g., 19×19 Go). - Computational overhead: Maintaining a regret network and prioritized buffer might add non-trivial computational cost; eff

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Artificial Intelligence in Games