Delightful Exploration

Ian Osband

arXiv:2605.13287·cs.LG·May 14, 2026

Delightful Exploration

Ian Osband

PDF

TL;DR

Delight-gated exploration (DE) is a practical heuristic that selectively explores actions based on a delight metric, improving efficiency and regret in large action spaces across various bandit and MDP settings.

Contribution

The paper introduces Delight-gated exploration (DE), a novel heuristic that adaptively balances exploration and exploitation using a delight metric, applicable across multiple problem types.

Findings

01

DE achieves weaker regret growth than Thompson Sampling and ε-greedy.

02

Hyperparameters transfer across different problem settings without retuning.

03

DE effectively manages exploration in large action spaces by prioritizing actions with high delight.

Abstract

Most exploration algorithms search broadly until uncertainty is resolved. When the action space is too large to resolve within budget, practitioners default to $ε$ -greedy, which bounds disruption but spends its override blindly. We introduce \textit{Delight-gated exploration} (DE), a host--override rule that spends exploratory actions only when their prospective delight (expected improvement times surprisal) exceeds a gate price. This practical heuristic recovers a classical result: Pandora's reservation-value rule for costly search, with surprisal setting the effective inspection cost. Resolved arms exit the gate, fresh arms shut off above a prior-determined threshold, and selected linear-bandit overrides consume finite information budget. Across Bernoulli bandits, linear bandits, and tabular MDPs, the same hyperparameters transfer without retuning, and DE shows much weaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.