Online Model Selection: a Rested Bandit Formulation
Leonardo Cella, Claudio Gentile, Massimiliano Pontil

TL;DR
This paper introduces a novel bandit-based approach for online model selection where expected losses decrease with plays, proposing an arm elimination algorithm that efficiently identifies the best arm by exploiting the problem's structure.
Contribution
It formulates a new rested bandit problem for model selection, develops an arm elimination algorithm, and provides theoretical analysis including regret bounds and lower bounds.
Findings
Regret diminishes as time horizon increases.
Algorithm exploits loss function structure for faster learning.
Lower bounds highlight the limits of the proposed method.
Abstract
Motivated by a natural problem in online model selection with bandit information, we introduce and analyze a best arm identification problem in the rested bandit setting, wherein arm expected losses decrease with the number of times the arm has been played. The shape of the expected loss functions is similar across arms, and is assumed to be available up to unknown parameters that have to be learned on the fly. We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game. We analyze an arm elimination algorithm whose regret vanishes as the time horizon increases. The actual rate of convergence depends in a detailed way on the postulated functional form of the expected losses. Unlike known model selection efforts in the recent bandit literature, our algorithm exploits the specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics
