Risk-Averse Multi-Armed Bandit Problems under Mean-Variance Measure

Sattar Vakili; Qing Zhao

arXiv:1604.05257·cs.LG·August 16, 2017

Risk-Averse Multi-Armed Bandit Problems under Mean-Variance Measure

Sattar Vakili, Qing Zhao

PDF

TL;DR

This paper extends multi-armed bandit analysis to include risk via the mean-variance measure, establishing regret bounds and adapting policies to optimize for risk-averse decision making.

Contribution

It introduces mean-variance based regret bounds for risk-averse bandits and adapts existing policies to achieve these bounds.

Findings

01

Lower bounds on regret: Ω(log T) and Ω(T^{2/3})

02

Modified UCB and DSEE policies achieve these bounds

03

Risk-averse bandit analysis aligns with classical results in a new risk measure

Abstract

The multi-armed bandit problems have been studied mainly under the measure of expected total reward accrued over a horizon of length $T$ . In this paper, we address the issue of risk in multi-armed bandit problems and develop parallel results under the measure of mean-variance, a commonly adopted risk measure in economics and mathematical finance. We show that the model-specific regret and the model-independent regret in terms of the mean-variance of the reward process are lower bounded by $Ω (lo g T)$ and $Ω (T^{2/3})$ , respectively. We then show that variations of the UCB policy and the DSEE policy developed for the classic risk-neutral MAB achieve these lower bounds.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.