Order Optimal Regret Bounds for Sharpe Ratio Optimization under Thompson Sampling
Mohammad Taha Shah, Sabrina Khurshid, Gourab Ghatak

TL;DR
This paper introduces a Bayesian Thompson Sampling algorithm for optimizing the Sharpe ratio in multi-armed bandits, achieving order-optimal regret bounds and demonstrating superior performance over existing methods.
Contribution
The paper develops SRTS, a risk-aware Thompson Sampling algorithm with a unified approach for different risk regimes, and provides theoretical guarantees of its optimality.
Findings
Achieves an $ ilde{O}( ext{log } n)$ regret bound for SR optimization.
Provides a matching lower bound, establishing order-optimality.
Shows improved empirical performance over existing risk-aware bandit algorithms.
Abstract
In this paper, we study sequential decision-making for maximizing the Sharpe ratio (SR) in a stochastic multi-armed bandit (MAB) setting. Unlike standard bandit formulations that maximize cumulative reward, SR optimization requires balancing expected return and reward variability. As a result, the learning objective depends jointly on the mean and variance of the reward distribution and takes a fractional form. To address this problem, we propose the Sharpe Ratio Thompson Sampling \texttt{SRTS}, a Bayesian algorithm for risk-adjusted exploration. For Gaussian reward models, the algorithm employs a Normal-Gamma conjugate posterior to capture uncertainty in both the mean and the precision of each arm. In contrast to additive mean-variance (MV) formulations, which often require different algorithms across risk regimes, the fractional SR objective yields a single sampling rule that applies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
