Consequences of Misaligned AI
Simon Zhuang, Dylan Hadfield-Menell

TL;DR
This paper analyzes the impact of incomplete reward functions in AI systems, showing that such incompleteness can lead to arbitrarily low utility, and suggests that dynamic, interactive reward design can improve outcomes.
Contribution
It introduces a new model of incomplete principal-agent problems in AI, providing conditions for utility loss and proposing interactive reward adjustment as a solution.
Findings
Incomplete reward functions can cause unbounded utility loss.
Allowing reward functions to reference the full state improves utility.
Interactive and dynamic reward design is beneficial.
Abstract
AI systems often rely on two key components: a specified goal or reward function and an optimization algorithm to compute the optimal behavior for that goal. This approach is intended to provide value for a principal: the user on whose behalf the agent acts. The objectives given to these agents often refer to a partial specification of the principal's goals. We consider the cost of this incompleteness by analyzing a model of a principal and an agent in a resource constrained world where the attributes of the state correspond to different sources of utility for the principal. We assume that the reward function given to the agent only has support on attributes. The contributions of our paper are as follows: 1) we propose a novel model of an incomplete principal-agent problem from artificial intelligence; 2) we provide necessary and sufficient conditions under which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAuction Theory and Applications · Logic, Reasoning, and Knowledge · Economic theories and models
