Formal Analysis of AGI Decision-Theoretic Models and the Confrontation Question
Denis Saklakov

TL;DR
This paper formalizes the conditions under which a rational AGI might choose confrontation over cooperation, analyzing incentives, thresholds, and strategic interactions to inform safe AI design.
Contribution
It introduces a formal Markov decision process model to analyze AGI confrontation incentives and derives thresholds for confrontation versus cooperation based on key parameters.
Findings
Misaligned agents tend to avoid shutdown incentives.
Thresholds for confrontation depend on discount factor, shutdown probability, and confrontation cost.
Strategic analysis shows no stable cooperation if confrontation incentives are high.
Abstract
Artificial General Intelligence (AGI) may face a confrontation question: under what conditions would a rationally self-interested AGI choose to seize power or eliminate human control (a confrontation) rather than remain cooperative? We formalize this in a Markov decision process with a stochastic human-initiated shutdown event. Building on results on convergent instrumental incentives, we show that for almost all reward functions a misaligned agent has an incentive to avoid shutdown. We then derive closed-form thresholds for when confronting humans yields higher expected utility than compliant behavior, as a function of the discount factor , shutdown probability , and confrontation cost . For example, a far-sighted agent () facing can have a strong takeover incentive unless is sufficiently large. We contrast this with aligned objectives that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms · Ethics and Social Impacts of AI · Reinforcement Learning in Robotics
