Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs
Yifan Lin, Yuhao Wang, Enlu Zhou

TL;DR
This paper studies a risk-averse version of the contextual multi-armed bandit problem with linear payoffs, proposing algorithms with regret bounds and demonstrating their effectiveness in portfolio selection.
Contribution
It introduces a risk-averse framework using mean-variance for contextual bandits and provides regret analysis for a Thompson Sampling-based algorithm.
Findings
Regret bound of $O((1+ ho+rac{1}{ ho}) d ext{ln} T ext{ln} rac{K}{ ext{delta}} ext{sqrt}{d K T^{1+2 extpsilon} ext{ln} rac{K}{ ext{delta}} rac{1}{ extvarepsilon}})$ with high probability.
Empirical results demonstrate the algorithm's effectiveness in portfolio selection.
The approach effectively balances risk and reward in sequential decision-making.
Abstract
In this paper we consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion. At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward. In particular, we consider mean-variance as the risk criterion, and the best arm is the one with the largest mean-variance reward. We apply the Thompson Sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm. For rounds, actions, and -dimensional feature vectors, we prove a regret bound of that holds with probability under the mean-variance criterion with risk tolerance , for any ,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Risk and Portfolio Optimization · Optimization and Search Problems
