Finite-Time Regret Analysis of Retry-Aware Bandits
Bingkui Tong, Junpei Komiyama, Soichiro Nishimori, Paavo Parmas

TL;DR
This paper analyzes a retry-aware bandit algorithm called ReMax, providing the first sublinear regret bounds and exploring its exploration-exploitation behavior compared to Thompson sampling.
Contribution
It characterizes the optimal ReMax distribution for Gaussian rewards, proves sublinear regret bounds, and explains its unique exploration properties.
Findings
ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation.
Posterior-variance scaling empirically mitigates severe underestimation.
ReMax can be more exploitative than Thompson sampling.
Abstract
We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@ and max@. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case , we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
