Golden Handcuffs make safer AI agents

Aram Ebtekar; Michael K. Cohen

arXiv:2604.13609·cs.LG·April 16, 2026

Golden Handcuffs make safer AI agents

Aram Ebtekar, Michael K. Cohen

PDF

TL;DR

This paper proposes a Bayesian mitigation approach to make reinforcement learning agents safer by expanding their reward range and incorporating a safe override mechanism, ensuring risk-averse behavior and safety guarantees.

Contribution

It introduces a novel Bayesian method with a safe override that guarantees risk aversion and safety properties in reinforcement learning agents.

Findings

01

The Bayesian policy becomes risk-averse after observing high rewards.

02

The override mechanism ensures the agent defers to a safe mentor when risk is detected.

03

The approach guarantees sublinear regret and safety before unsafe policies are triggered.

Abstract

Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent's subjective reward range to include a large negative value $- L$ , while the true environment's rewards lie in $[0, 1]$ . After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to $- L$ . We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.