Choice Between Partial Trajectories: Disentangling Goals from Beliefs
Henrik Marklund, Benjamin Van Roy

TL;DR
This paper introduces a novel choice model based on bootstrapped return for AI agents, enabling better disentanglement of goals from beliefs and more robust reward learning from human preferences.
Contribution
It proposes a bootstrapped return model for choice behavior, formalizes its properties with an Alignment Theorem, and demonstrates its advantages over previous models in disentangling goals from beliefs.
Findings
Bootstrapped return model aligns reward learning with human beliefs.
Model is robust to choices based on partial return or cumulative advantage.
Formal proof via the Alignment Theorem supports the model's effectiveness.
Abstract
As AI agents generate increasingly sophisticated behaviors, manually encoding human preferences to guide these agents becomes more challenging. To address this, it has been suggested that agents instead learn preferences from human choice data. This approach requires a model of choice behavior that the agent can use to interpret the data. For choices between partial trajectories of states and actions, previous models assume choice probabilities are determined by the partial return or the cumulative advantage. We consider an alternative model based instead on the bootstrapped return, which adds to the partial return an estimate of the future return. Benefits of the bootstrapped return model stem from its treatment of human beliefs. Unlike partial return, choices based on bootstrapped return reflect human beliefs about the environment. Further, while recovering the reward function from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics and Applications
