TL;DR
This paper introduces a causal framework using influence diagrams to analyze agent incentives, providing new criteria for value of control and incentives, aiding AI safety and fairness evaluations.
Contribution
It develops a comprehensive causal influence diagram framework with novel criteria for value of control and incentives, enhancing analysis of agent behavior.
Findings
Established the completeness of the value of information criterion.
Proposed a new graphical criterion for value of control with proven soundness and completeness.
Introduced response incentives and instrumental control incentives with graphical criteria.
Abstract
We present a framework for analysing agent incentives using causal influence diagrams. We establish that a well-known criterion for value of information is complete. We propose a new graphical criterion for value of control, establishing its soundness and completeness. We also introduce two new concepts for incentive analysis: response incentives indicate which changes in the environment affect an optimal decision, while instrumental control incentives establish whether an agent can influence its utility via a variable X. For both new concepts, we provide sound and complete graphical criteria. We show by example how these results can help with evaluating the safety and fairness of an AI system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
