Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game
T\~onis Lees, Tambet Matiisen

TL;DR
This paper adapts the AlphaZero reinforcement learning algorithm to the asymmetric board game Tablut by modifying its architecture and stabilization techniques, enabling effective self-play learning in a complex environment.
Contribution
The study introduces separate policy and value heads for each player role in AlphaZero, along with stabilization methods, to successfully apply it to asymmetric games like Tablut.
Findings
Achieved a BayesElo rating of 1235 after 100 self-play iterations.
Significant reduction in policy entropy indicating more decisive play.
Overcame training instabilities using data augmentation and replay buffer enhancements.
Abstract
This work investigates the adaptation of the AlphaZero reinforcement learning algorithm to Tablut, an asymmetric historical board game featuring unequal piece counts and distinct player objectives (king capture versus king escape). While the original AlphaZero architecture successfully leverages a single policy and value head for symmetric games, applying it to asymmetric environments forces the network to learn two conflicting evaluation functions, which can hinder learning efficiency and performance. To address this, the core architecture is modified to use separate policy and value heads for each player role, while maintaining a shared residual trunk to learn common board features. During training, the asymmetric structure introduced training instabilities, notably catastrophic forgetting between the attacker and defender roles. These issues were mitigated by applying C4 data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
