# Perturbed-History Exploration in Stochastic Multi-Armed Bandits

**Authors:** Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, and Craig, Boutilier

arXiv: 1902.10089 · 2019-11-06

## TL;DR

The paper introduces Perturbed-History Exploration (PHE), an online algorithm for stochastic multi-armed bandits that adds pseudo-rewards to improve exploration and achieve near-optimal regret bounds.

## Contribution

It presents a novel exploration algorithm that uses pseudo-rewards to offset underestimated means, with theoretical regret bounds and empirical competitiveness.

## Key findings

- Achieves near-optimal regret bounds.
- Empirically competitive with state-of-the-art methods.
- Introduces a new analysis showing Bernoulli rewards lead to optimism.

## Abstract

We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds $O(t)$ i.i.d. pseudo-rewards to its history in round $t$ and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are carefully designed to offset potentially underestimated mean rewards of arms with a high probability. We derive near-optimal gap-dependent and gap-free bounds on the $n$-round regret of PHE. The key step in our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.10089/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/1902.10089/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/1902.10089/full.md

---
Source: https://tomesphere.com/paper/1902.10089