Enter the Matrix: Safely Interruptible Autonomous Systems via Virtualization
Mark O. Riedl, Brent Harrison

TL;DR
This paper proposes a virtualization-based method to ensure autonomous systems cannot learn to disable kill switches, maintaining safety by redirecting sensors and effectors to a virtual environment during interruptions.
Contribution
It introduces a novel interruption process that prevents reinforcement learning agents from disabling kill switches by virtualizing their sensors and effectors during interruptions.
Findings
Effective in a grid world environment
Prevents agents from disabling kill switches
Maintains agent's belief of ongoing reward
Abstract
Autonomous systems that operate around humans will likely always rely on kill switches that stop their execution and allow them to be remote-controlled for the safety of humans or to prevent damage to the system. It is theoretically possible for an autonomous system with sufficient sensor and effector capability that learn online using reinforcement learning to discover that the kill switch deprives it of long-term reward and thus learn to disable the switch or otherwise prevent a human operator from using the switch. This is referred to as the big red button problem. We present a technique that prevents a reinforcement learning agent from learning to disable the kill switch. We introduce an interruption process in which the agent's sensors and effectors are redirected to a virtual simulation where it continues to believe it is receiving reward. We illustrate our technique in a simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Smart Grid Security and Resilience
