Learning under Invariable Bayesian Safety
Gal Bahar, Omer Ben-Porat, Kevin Leyton-Brown, Moshe Tennenholtz

TL;DR
This paper introduces a new safety constraint in explore-and-exploit systems, ensuring each decision maintains a minimum expected value, and proposes an asymptotically optimal algorithm for this setting.
Contribution
It models a safety constraint that must be respected in every round and develops an optimal algorithm with proven convergence rates.
Findings
Proposed an asymptotically optimal safe exploration algorithm.
Analyzed the instance-dependent convergence rate.
Ensured safety constraints are maintained in all rounds.
Abstract
A recent body of work addresses safety constraints in explore-and-exploit systems. Such constraints arise where, for example, exploration is carried out by individuals whose welfare should be balanced with overall welfare. In this paper, we adopt a model inspired by recent work on a bandit-like setting for recommendations. We contribute to this line of literature by introducing a safety constraint that should be respected in every round and determines that the expected value in each round is above a given threshold. Due to our modeling, the safe explore-and-exploit policy deserves careful planning, or otherwise, it will lead to sub-optimal welfare. We devise an asymptotically optimal algorithm for the setting and analyze its instance-dependent convergence rate.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
