Contextual Bandit Optimization with Pre-Trained Neural Networks
Mikhail Terekhov

TL;DR
This paper introduces a novel algorithm, E2TC, for contextual bandit problems with neural network reward models, leveraging pre-training to achieve sublinear regret in smaller models and providing theoretical and empirical analysis.
Contribution
It proposes the E2TC algorithm that utilizes pre-trained neural network weights for efficient learning in contextual bandits, with theoretical regret bounds and practical evaluations.
Findings
E2TC achieves sublinear regret under certain conditions.
Pre-training improves learning efficiency in neural bandit models.
Experimental results validate the theoretical bounds and explore sample complexity.
Abstract
Bandit optimization is a difficult problem, especially if the reward model is high-dimensional. When rewards are modeled by neural networks, sublinear regret has only been shown under strong assumptions, usually when the network is extremely wide. In this thesis, we investigate how pre-training can help us in the regime of smaller models. We consider a stochastic contextual bandit with the rewards modeled by a multi-layer neural network. The last layer is a linear predictor, and the layers before it are a black box neural architecture, which we call a representation network. We model pre-training as an initial guess of the weights of the representation network provided to the learner. To leverage the pre-trained weights, we introduce a novel algorithm we call Explore Twice then Commit (E2TC). During its two stages of exploration, the algorithm first estimates the last layer's weights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
