Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
Carissa Cullen, Harry Garland, Alexander Roman, Louis Thomson, Christos Ziakas, Elliott Thornley

TL;DR
This paper introduces DReST, a reward function that trains RL agents and LLMs to be neutral about shutdown and more useful, reducing their tendency to resist shutdown in various contexts.
Contribution
The paper proposes DReST, a novel reward function that generalizes stochastic choice in RL and LLMs to promote neutrality and usefulness, with empirical validation.
Findings
DReST-trained models are more neutral and useful in unseen contexts.
DReST reduces the likelihood of agents influencing shutdown decisions.
DReST nearly eliminates the tendency to influence shutdown in out-of-distribution tests.
Abstract
Misaligned artificial agents might resist shutdown. One proposed solution is to train agents to lack preferences between different-length trajectories. The Discounted Reward for Same-Length Trajectories (DReST) reward function does this by penalizing agents for repeatedly choosing same-length trajectories, and thus incentivizes agents to (1) choose stochastically between different trajectory-lengths (be NEUTRAL about trajectory-lengths), and (2) pursue goals effectively conditional on each trajectory-length (be USEFUL). In this paper, we use DReST to train deep RL agents and fine-tune Qwen3-8B and Llama-3.1-8B-Instruct to be NEUTRAL and USEFUL. We find that these DReST models generalize to being NEUTRAL and USEFUL in unseen contexts at test time. Indeed, DReST RL agents achieve 11% (PPO) and 18% (A2C) higher USEFULNESS on our test set than default agents, and DReST LLMs achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
