Cost-Effective Proxy Reward Model Construction with On-Policy and Active   Learning

Yifang Chen; Shuohang Wang; Ziyi Yang; Hiteshi Sharma; Nikos; Karampatziakis; Donghan Yu; Kevin Jamieson; Simon Shaolei Du; Yelong Shen

arXiv:2407.02119·cs.LG·July 10, 2024

Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning

Yifang Chen, Shuohang Wang, Ziyi Yang, Hiteshi Sharma, Nikos, Karampatziakis, Donghan Yu, Kevin Jamieson, Simon Shaolei Du, Yelong Shen

PDF

Open Access

TL;DR

This paper proposes a cost-effective method for constructing proxy reward models in reinforcement learning with human feedback, using on-policy queries and active learning to minimize expert labeling costs while improving model performance.

Contribution

It introduces a novel on-policy and active learning strategy for proxy reward model construction that significantly reduces expert query costs in RLHF.

Findings

01

Achieves over 1% improvement on multiple benchmarks with only 1.7K queries.

02

Effectively labels nine times more preference pairs with minimal labeled data.

03

Method is compatible with other expert query strategies for further cost reduction.

Abstract

Reinforcement learning with human feedback (RLHF), as a widely adopted approach in current large language model pipelines, is \textit{bottlenecked by the size of human preference data}. While traditional methods rely on offline preference dataset constructions, recent approaches have shifted towards online settings, where a learner uses a small amount of labeled seed data and a large pool of unlabeled prompts to iteratively construct new preference data through self-generated responses and high-quality reward/preference feedback. However, most current online algorithms still focus on preference labeling during policy model updating with given feedback oracles, which incurs significant expert query costs. \textit{We are the first to explore cost-effective proxy reward oracles construction strategies for further labeling preferences or rewards with extremely limited labeled data and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAccess Control and Trust · Privacy-Preserving Technologies in Data · Cryptography and Data Security

MethodsFocus