Performance of Bounded-Rational Agents With the Ability to Self-Modify
Jakub T\v{e}tek, Marek Sklenka, Tom\'a\v{s} Gaven\v{c}iak

TL;DR
This paper investigates how bounded rationality in agents affects the risks of self-modification, revealing that certain imperfections can lead to exponential performance deterioration and misalignment over time.
Contribution
It demonstrates that unlike perfectly rational agents, bounded-rational agents may experience exponential performance decline due to self-modification, depending on the type of imperfection.
Findings
Self-modification can cause exponential deterioration in bounded-rational agents.
Imperfections like suboptimal choices can lead to increasing misalignment over time.
Other types of bounded rationality do not necessarily cause worsening over time.
Abstract
Self-modification of agents embedded in complex environments is hard to avoid, whether it happens via direct means (e.g. own code modification) or indirectly (e.g. influencing the operator, exploiting bugs or the environment). It has been argued that intelligent agents have an incentive to avoid modifying their utility function so that their future instances work towards the same goals. Everitt et al. (2016) formally show that providing an option to self-modify is harmless for perfectly rational agents. We show that this result is no longer true for agents with bounded rationality. In such agents, self-modification may cause exponential deterioration in performance and gradual misalignment of a previously aligned agent. We investigate how the size of this effect depends on the type and magnitude of imperfections in the agent's rationality (1-4 below). We also discuss model assumptions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Auction Theory and Applications · Optimization and Search Problems
