Investigating Scale Independent UCT Exploration Factor Strategies
Robin Schm\"ocker, Christoph Schnell, Alexander Dockhorn

TL;DR
This paper proposes adaptive strategies for setting the UCT exploration constant in tree search algorithms, making them robust to different reward scales across various games, and demonstrates their effectiveness through experiments.
Contribution
The paper introduces five new lambda-strategies for UCT exploration, including a data-driven method using Q-value standard deviation, improving performance across diverse tasks.
Findings
The proposed lambda = 2 * standard deviation method outperforms existing strategies.
Adaptive lambda strategies achieve better peak performance and robustness.
Experimental results span a wide range of game environments.
Abstract
The Upper Confidence Bounds For Trees (UCT) algorithm is not agnostic to the reward scale of the game it is applied to. For zero-sum games with the sparse rewards of at the end of the game, this is not a problem, but many games often feature dense rewards with hand-picked reward scales, causing a node's Q-value to span different magnitudes across different games. In this paper, we evaluate various strategies for adaptively choosing the UCT exploration constant , called -strategies, that are agnostic to the game's reward scale. These -strategies include those proposed in the literature as well as five new strategies. Given our experimental results, we recommend using one of our newly suggested -strategies, which is to choose as where is the empirical standard deviation of all state-action pairs' Q-values…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research
