Constructing an Optimal Behavior Basis for the Option Keyboard
Lucas N. Alegre, Ana L. C. Bazzan, Andr\'e Barreto, Bruno C. da Silva

TL;DR
This paper introduces a novel method to construct an optimal behavior basis for the Option Keyboard in multi-task reinforcement learning, enabling zero-shot optimal solutions for linear and certain non-linear tasks with fewer base policies.
Contribution
It provides an efficient way to build an optimal behavior basis that outperforms existing coverage sets and scales to complex domains.
Findings
Reduces the number of base policies needed for optimality.
Enables solving certain non-linear tasks optimally.
Outperforms state-of-the-art approaches in complex domains.
Abstract
Multi-task reinforcement learning aims to quickly identify solutions for new tasks with minimal or no additional interaction with the environment. Generalized Policy Improvement (GPI) addresses this by combining a set of base policies to produce a new one that is at least as good -- though not necessarily optimal -- as any individual base policy. Optimality can be ensured, particularly in the linear-reward case, via techniques that compute a Convex Coverage Set (CCS). However, these are computationally expensive and do not scale to complex domains. The Option Keyboard (OK) improves upon GPI by producing policies that are at least as good -- and often better. It achieves this through a learned meta-policy that dynamically combines base policies. However, its performance critically depends on the choice of base policies. This raises a key question: is there an optimal set of base policies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsICT Impact and Policies · Digital Platforms and Economics · Economic theories and models
MethodsBalanced Selection · Sparse Evolutionary Training
