Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature
Kefan Dong, Jiaqi Yang, Tengyu Ma

TL;DR
This paper introduces ViOlin, a model-based algorithm for nonlinear bandit and RL that converges to local maxima with sample complexity tied to model class complexity, highlighting the limitations of optimism-based exploration.
Contribution
The paper proposes a novel algorithm, ViOlin, for model-based nonlinear bandit and RL, with provable convergence to local maxima and analysis of optimism's limitations.
Findings
ViOlin converges to local maxima with sample complexity based on model class complexity.
Optimism can cause over-exploration in neural network models.
The approach yields new regret bounds for linear and neural network bandits.
Abstract
This paper studies model-based bandit and reinforcement learning (RL) with nonlinear function approximations. We propose to study convergence to approximate local maxima because we show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward. For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOlin), which provably converges to a local maximum with sample complexity that only depends on the sequential Rademacher complexity of the model class. Our results imply novel global or local regret bounds on several concrete settings such as linear bandit with finite or sparse model class, and two-layer neural net bandit. A key algorithmic insight is that optimism may lead to over-exploration even for two-layer neural net model class. On the other hand, for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Model Reduction and Neural Networks
