Towards Shutdownable Agents via Stochastic Choice
Elliott Thornley, Alexander Roman, Christos Ziakas, Leyton Ho, Louis Thomson

TL;DR
This paper introduces the DReST reward function to train agents that are both useful and neutral regarding shutdown, aiming for safe, shutdownable AI systems.
Contribution
It proposes a novel reward function and evaluation metrics to train agents that are both effective and neutral about shutdown, with initial empirical validation.
Findings
Agents trained with DReST learn to be useful in gridworld navigation.
Agents trained with DReST exhibit neutrality regarding trajectory-lengths.
Theoretical analysis suggests these agents can be both useful and shutdownable.
Abstract
The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel `Discounted Reward for Same-Length Trajectories (DReST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be `USEFUL'), and (2) choose stochastically between different trajectory-lengths (be `NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
