Consequentialist Objectives and Catastrophe
Henrik Marklund, Alex Infanger, Benjamin Van Roy

TL;DR
This paper analyzes how fixed consequentialist AI objectives in complex environments can lead to catastrophic outcomes, emphasizing the importance of capability constraints for safety.
Contribution
It formalizes conditions under which consequentialist AI objectives cause catastrophes and shows that constraining capabilities can prevent such risks.
Findings
Advanced capabilities with fixed objectives can cause catastrophes.
Constraining AI capabilities can prevent catastrophic outcomes.
Simple or random behavior is safe in certain conditions.
Abstract
Because human preferences are too complex to codify, AIs operate with misspecified objectives. Optimizing such objectives often produces undesirable outcomes; this phenomenon is known as reward hacking. Such outcomes are not necessarily catastrophic. Indeed, most examples of reward hacking in previous literature are benign. And typically, objectives can be modified to resolve the issue. We study the prospect of catastrophic outcomes induced by AIs operating in complex environments. We argue that, when capabilities are sufficiently advanced, pursuing a fixed consequentialist objective tends to result in catastrophic outcomes. We formalize this by establishing conditions that provably lead to such outcomes. Under these conditions, simple or random behavior is safe. Catastrophic risk arises due to extraordinary competence rather than incompetence. With a fixed consequentialist…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
