OptionZero: Planning with Learned Options

Po-Wei Huang; Pei-Chiun Peng; Hung Guei; Ti-Rong Wu

arXiv:2502.16634·cs.AI·March 24, 2025

OptionZero: Planning with Learned Options

Po-Wei Huang, Pei-Chiun Peng, Hung Guei, Ti-Rong Wu

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

OptionZero enhances planning in reinforcement learning by autonomously discovering options through self-play, leading to significant performance improvements over MuZero in Atari games.

Contribution

It introduces an option network into MuZero for autonomous option discovery and modifies the dynamics network to enable deeper search with options.

Findings

01

Outperforms MuZero with 131.58% score improvement in Atari games.

02

Learns strategic options tailored to different game characteristics.

03

Demonstrates effective autonomous discovery of options through self-play.

Abstract

Planning with options -- a sequence of primitive actions -- has been shown effective in reinforcement learning within complex environments. Previous studies have focused on planning with predefined options or learned options through expert demonstration data. Inspired by MuZero, which learns superhuman heuristics without any human knowledge, we propose a novel approach, named OptionZero. OptionZero incorporates an option network into MuZero, providing autonomous discovery of options through self-play games. Furthermore, we modify the dynamics network to provide environment transitions when using options, allowing searching deeper under the same simulation constraints. Empirical experiments conducted in 26 Atari games demonstrate that OptionZero outperforms MuZero, achieving a 131.58% improvement in mean human-normalized score. Our behavior analysis shows that OptionZero not only learns…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 3

Strengths

The need for efficiency in decision making in RL is clear, as single-step actions are slow and computationally expensive (even more so in slow simulators). Thus, the problem addressed by OptionZero is clear and its existence is well-motivated. Additionally, since much prior work in options appear to be in manually defined and demonstration-based settings, the generalisability of OptionZero is a strong selling point. Within fixed computational constraints, the idea of decreasing the frequency of

Weaknesses

It is unclear why options are outperformed by primitive actions in certain environments. The authors suggest that in environments with high combinatorial complexity, learning of the dynamics model may be difficult and thus options may simply produce more overhead than actual benefit. A more detailed analysis of these environments would be beneficial, for e.g. investigate whether there is a correlation between the stochastic branching factor of the environment and the performance of options. Add

Reviewer 02Rating 8Confidence 3

Strengths

The paper is well-written. The authors explain the use of options clearly with a toy example, demonstrating how options are used. The empirical results are also strong, achieving high mean normalized scores.

Weaknesses

It's not clear the actual benefits options bring. In the intro, the paper claims options allow for "searching deeper", but the empirical analysis shows "deeper search is likely but not necessary for improving performance". While it's nice to have the option to do option, could the authors provide a more detailed analysis of options beyond a deeper search? The paper could also benefit more from discussions of 1) the trade-offs between increased complexity and performance gains 2) how much tuni

Reviewer 03Rating 6Confidence 4

Strengths

- Novel idea for autonomous and adaptable option discovery. The proposed method's ability to autonomously discover and tailor options to diverse game dynamics removes the need for predefined actions, making it highly adaptable across different environments. - Convincing results for enhanced planning in RL. By integrating an option network, OptionZero reduces decision frequency, enabling computational efficiency, particularly in visually complex tasks like Atari games. - Strong Performance Gai

Weaknesses

- Inconsistent Option Use Across Games: OptionZero's reliance on options appears to vary widely across Atari games. While longer options bring substantial gains in some games, they contribute less in others. This inconsistency suggests that the model’s option-based planning may struggle to generalize well across diverse, complex environments. The paper should discuss this limitation. - Challenges in Complex Action Spaces: In games with intricate action spaces, such as Bank Heist (Atari), Optio

Code & Models

Repositories

rlglab/optionzero
noneOfficial

Videos

OptionZero: Planning with Learned Options· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · AI-based Problem Solving and Planning

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Average Pooling · Prioritized Experience Replay · Residual Connection · Residual Block · Monte-Carlo Tree Search · Convolution · MuZero