AI Alignment with Changing and Influenceable Reward Functions
Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca, Dragan

TL;DR
This paper introduces Dynamic Reward Markov Decision Processes to model changing and influenceable human preferences, revealing limitations of static assumptions in AI alignment and exploring potential solutions and their trade-offs.
Contribution
It formalizes preference change in AI alignment, compares multiple alignment notions under this framework, and highlights the challenges of balancing influence and risk in real-world AI systems.
Findings
Static preference assumptions can lead to undesirable AI influence.
Most alignment notions either risk influence or are overly cautious.
Handling changing preferences requires careful balancing of risks and capabilities.
Abstract
Existing AI alignment approaches assume that preferences are static, which is unrealistic: our preferences change, and may even be influenced by our interactions with AI systems themselves. To clarify the consequences of incorrectly assuming static preferences, we introduce Dynamic Reward Markov Decision Processes (DR-MDPs), which explicitly model preference changes and the AI's influence on them. We show that despite its convenience, the static-preference assumption may undermine the soundness of existing alignment techniques, leading them to implicitly reward AI systems for influencing user preferences in ways users may not truly want. We then explore potential solutions. First, we offer a unifying perspective on how an agent's optimization horizon may partially help reduce undesirable AI influence. Then, we formalize different notions of AI alignment that account for preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
