The Limits of Preference Data for Post-Training
Eric Zhao, Jessica Dai, Pranjal Awasthi

TL;DR
This paper demonstrates fundamental limitations of using preference data for outcome-based optimization in reinforcement learning, especially affecting the elicitation of reasoning behaviors, and highlights the need for human scoring and new algorithms.
Contribution
It formalizes the limitations of preference data in outcome optimization using voting theory and explains why these constraints hinder RLHF's effectiveness in eliciting reasoning behaviors.
Findings
Preference data can fundamentally limit outcome optimization even with ideal data.
Limitations mainly impact RLHF's ability to elicit robust reasoning strategies.
Grounded human scoring and new algorithms are necessary to overcome these limitations.
Abstract
Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or -wise) that indicate, for given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScheduling and Timetabling Solutions · Human Resource Development and Performance Evaluation · Occupational and Professional Licensing Regulation
