Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback
Haishan Ye

TL;DR
This paper establishes the first high-probability regret bounds for online convex optimization with two-point bandit feedback on strongly convex functions, achieving minimax optimality in dimension and time horizon.
Contribution
It provides the first high-probability regret bounds for strongly convex functions in two-point bandit feedback, resolving a longstanding open problem.
Findings
Achieved a regret bound of O(d(log T + log(1/δ))/μ) for strongly convex losses.
Bound is minimax optimal with respect to T and d.
Addresses the challenge of heavy-tailed gradient estimators in bandit settings.
Abstract
We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points. While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citet{agarwal2010optimal}. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult. In this paper, we resolve this open challenge and provide the first high-probability regret bound of for -strongly convex losses. Our result is minimax optimal with respect to both the time horizon and the dimension .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
