Learning Optimal Advantage from Preferences and Mistaking it for Reward
W. Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson,, Serena Booth, Anca Dragan, Peter Stone, Scott Niekum

TL;DR
This paper explores how learning from human preferences in reinforcement learning often results in approximating advantage functions rather than true rewards, explaining why current methods perform well despite model mismatches.
Contribution
It demonstrates that preferences based on regret lead to learning an advantage function approximation, clarifies the implications for RLHF, and compares different optimization strategies.
Findings
Preferences based on regret approximate the advantage function, not reward.
Addressing a specific pitfall yields a highly shaped reward function.
Greedy maximization of the advantage is preferable to using the learned advantage as a reward.
Abstract
We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return. Recent work casts doubt on the validity of this assumption, proposing an alternative preference model based upon regret. We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret. We argue that the learned function is an approximation of the optimal advantage function, , not a reward function. We find that if a specific pitfall is addressed, this incorrect assumption is not particularly harmful, resulting in a highly shaped reward function. Nonetheless, this incorrect usage of is less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
