Concave Statistical Utility Maximization Bandits via Influence-Function Gradients
Mat\'ias Carrasco, Alejandro Cholaquidis

TL;DR
This paper introduces a new approach for stochastic multi-armed bandits focusing on statistical utilities of reward distributions, using influence functions for gradient estimation and mirror ascent for optimization.
Contribution
It develops a novel influence-function based gradient estimation method for distributional utilities and applies mirror ascent algorithms to optimize these utilities in bandit settings.
Findings
Established regret bounds separating optimization error from influence function bias.
Demonstrated the framework on variance and Wasserstein utilities with numerical experiments.
Compared exact and plug-in influence-function implementations showing practical effectiveness.
Abstract
We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon problem reduces to optimizing over stationary mixed policies: each weight vector \(w\) on the simplex induces a mixture law \(P^w\), and performance is measured by the concave utility \(U(w)=\mathfrak U(P^w)\). For differentiable statistical utilities, we use influence-function calculus to derive stochastic gradient estimators from bandit feedback. This leads to an entropic mirror-ascent algorithm on a truncated simplex, implemented through multiplicative-weights updates and plug-in estimates of the influence function. We establish regret bounds that separate the mirror-ascent optimization error from the bias caused by estimating the influence function. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
