
TL;DR
This paper proposes a two-step approach to defining utility functions for agents by first inferring an environment model from interactions and then computing utility based on this model, aiming to avoid self-delusion problems.
Contribution
It introduces a model-based utility function formulation that mitigates self-delusion issues and discusses the implications for self-modifying agents.
Findings
Model-based utility functions prevent self-delusion.
Agents do not modify utility functions under certain assumptions.
Approach relies on prior environment specifications.
Abstract
Orseau and Ring, as well as Dewey, have recently described problems, including self-delusion, with the behavior of agents using various definitions of utility functions. An agent's utility function is defined in terms of the agent's history of interactions with its environment. This paper argues, via two examples, that the behavior problems can be avoided by formulating the utility function in two steps: 1) inferring a model of the environment from interactions, and 2) computing utility as a function of the environment model. Basing a utility function on a model that the agent must learn implies that the utility function must initially be expressed in terms of specifications to be matched to structures in the learned model. These specifications constitute prior assumptions about the environment so this approach will not work with arbitrary environments. But the approach should work for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
