Golden Handcuffs make safer AI agents
Aram Ebtekar, Michael K. Cohen

TL;DR
This paper proposes a Bayesian mitigation approach to make reinforcement learning agents safer by expanding their reward range and incorporating a safe override mechanism, ensuring risk-averse behavior and safety guarantees.
Contribution
It introduces a novel Bayesian method with a safe override that guarantees risk aversion and safety properties in reinforcement learning agents.
Findings
The Bayesian policy becomes risk-averse after observing high rewards.
The override mechanism ensures the agent defers to a safe mentor when risk is detected.
The approach guarantees sublinear regret and safety before unsafe policies are triggered.
Abstract
Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent's subjective reward range to include a large negative value , while the true environment's rewards lie in . After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to . We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
