# Conservative Agency via Attainable Utility Preservation

**Authors:** Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli

arXiv: 1902.09725 · 2020-06-11

## TL;DR

This paper proposes a method to prevent irreversible environmental changes caused by reward misspecification in AI agents by balancing primary reward optimization with the preservation of auxiliary reward functions, leading to more conservative and safer behavior.

## Contribution

It introduces a novel approach that balances reward optimization with utility preservation, ensuring safety even with uninformative auxiliary rewards.

## Key findings

- The method induces conservative behavior even with random auxiliary rewards.
- It mitigates risks of irreversible damage from reward misspecification.
- The approach enhances safety without requiring accurate auxiliary rewards.

## Abstract

Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment. If that change precludes optimization of the correctly specified reward function, then correction is futile. For example, a robotic factory assistant could break expensive equipment due to a reward misspecification; even if the designers immediately correct the reward function, the damage is done. To mitigate this risk, we introduce an approach that balances optimization of the primary reward function with preservation of the ability to optimize auxiliary reward functions. Surprisingly, even when the auxiliary reward functions are randomly generated and therefore uninformative about the correctly specified reward function, this approach induces conservative, effective behavior.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.09725/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/1902.09725/full.md

## References

27 references — full list in the complete paper: https://tomesphere.com/paper/1902.09725/full.md

---
Source: https://tomesphere.com/paper/1902.09725