Tight Rates for Bandit Control Beyond Quadratics
Y. Jennifer Sun, Zhou Lu

TL;DR
This paper introduces an algorithm that achieves near-optimal regret bounds for complex bandit control problems with adversarial and non-quadratic costs, surpassing previous bounds.
Contribution
It presents a novel algorithm that attains $ ilde{O}( oot{2} otimes T)$ regret for bandit control with non-quadratic costs, overcoming memory challenges and improving prior results.
Findings
Achieves $ ilde{O}( oot{2} otimes T)$ regret in bandit control.
Develops an improved bandit convex optimization algorithm with memory.
Demonstrates the effectiveness of reduction to memoryless BCO.
Abstract
Unlike classical control theory, such as Linear Quadratic Control (LQC), real-world control problems are highly complex. These problems often involve adversarial perturbations, bandit feedback models, and non-quadratic, adversarially chosen cost functions. A fundamental yet unresolved question is whether optimal regret can be achieved for these general control problems. The standard approach to addressing this problem involves a reduction to bandit convex optimization with memory. In the bandit setting, constructing a gradient estimator with low variance is challenging due to the memory structure and non-quadratic loss functions. In this paper, we provide an affirmative answer to this question. Our main contribution is an algorithm that achieves an optimal regret for bandit non-stochastic control with strongly-convex and smooth cost functions in the presence of…
Peer Reviews
Decision·NeurIPS 2024 poster
1. Though I have skimmed the proof of several lemmas, the analysis part seems to be rigorous and mathematically correct. 2. The delayed mechanism to de-correlate the recent m iterates looks interesting, which may be of use in the other delayed feedback setting.
1. The specific contribution of this work towards the previous work of Suggala et al. [2024] is still a little bit unclear. According to Line 382 to 387, it seem that the most important algorithmic contribution is the delay mechanism. 2. Not certain what it means by "preserves an estimation of Hessian $H_t$ for free" in Line 236. It seems related to Assumption 5 which provides the $H_t$ to the learner directly at the end of each iteration. I wonder if this sort of assumptions is general, and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Decision-Making and Behavioral Economics
