Reward-Biased Maximum Likelihood Estimation for Neural Contextual Bandits
Yu-Heng Hung, Ping-Chun Hsieh

TL;DR
This paper introduces NeuralRBMLE, a neural network-based approach for stochastic contextual bandits that incorporates reward-biased maximum likelihood estimation to enhance exploration and achieve competitive regret bounds.
Contribution
It adapts the classic RBMLE principle with neural networks for contextual bandits, proposing two algorithms with theoretical regret guarantees and superior empirical performance.
Findings
Both algorithms achieve rac{}{}( ilde{O}(\u221a{T})) regret.
NeuralRBMLE methods outperform state-of-the-art on real datasets.
The approach encodes exploration directly in neural network parameters.
Abstract
Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs. This paper studies the stochastic contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration. NeuralRBMLE leverages the representation power of neural networks and directly encodes exploratory behavior in the parameter space, without constructing confidence intervals of the estimated rewards. We propose two variants of NeuralRBMLE algorithms: The first variant directly obtains the RBMLE estimator by gradient ascent, and the second variant simplifies RBMLE to a simple index policy through an approximation. We show that both algorithms achieve regret. Through extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Age of Information Optimization
