Regret bounds for Narendra-Shapiro bandit algorithms
S\'ebastien Gadat, Fabien Panloup, Sofiane Saadane

TL;DR
This paper investigates the regret bounds of Narendra-Shapiro bandit algorithms, demonstrating that penalized versions can achieve sublinear regret bounds and extending convergence results to multi-armed cases with PDMP analysis.
Contribution
The paper provides the first regret bounds for penalized Narendra-Shapiro algorithms and extends convergence and ergodic properties to multi-armed bandits with PDMP analysis.
Findings
Pseudo-regret is bounded by C√n for penalized two-armed bandits.
Convergence results are extended to multi-armed bandits.
Ergodic properties of the associated PDMP are established.
Abstract
Narendra-Shapiro (NS) algorithms are bandit-type algorithms that have been introduced in the sixties (with a view to applications in Psychology or learning automata), whose convergence has been intensively studied in the stochastic algorithm literature. In this paper, we adress the following question: are the Narendra-Shapiro (NS) bandit algorithms competitive from a \textit{regret} point of view? In our main result, we show that some competitive bounds can be obtained for such algorithms in their penalized version (introduced in \cite{Lamberton_Pages}). More precisely, up to an over-penalization modification, the pseudo-regret related to the penalized two-armed bandit algorithm is uniformly bounded by (where is made explicit in the paper). \noindent We also generalize existing convergence and rates of convergence results to the multi-armed case of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Machine Learning and Algorithms
