Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies
Fr\'ed\'eric Berdoz, Roger Wattenhofer

TL;DR
This paper introduces a formal framework for defining and ensuring the safety and alignment of autonomous agents in social decision-making, proposing methods to verify and safeguard their policies.
Contribution
It presents a novel quantitative definition of alignment, introduces probably approximately aligned policies, and offers a practical method to verify and safeguard autonomous agent actions.
Findings
Defined a formal measure of alignment in social decision-making.
Derived conditions for the existence of near-optimal aligned policies.
Proposed a simple method to verify and ensure the safety of autonomous agent actions.
Abstract
While autonomous agents often surpass humans in their ability to handle vast and complex data, their potential misalignment (i.e., lack of transparency regarding their true objective) has thus far hindered their use in critical applications such as social decision processes. More importantly, existing alignment methods provide no formal guarantees on the safety of such models. Drawing from utility and social choice theory, we provide a novel quantitative definition of alignment in the context of social decision-making. Building on this definition, we introduce probably approximately aligned (i.e., near-optimal) policies, and we derive a sufficient condition for their existence. Lastly, recognizing the practical difficulty of satisfying this condition, we introduce the relaxed concept of safe (i.e., nondestructive) policies, and we propose a simple yet robust method to safeguard the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBlockchain Technology Applications and Security · Ethics and Social Impacts of AI · Auction Theory and Applications
