Semi-Supervised Reward Modeling via Iterative Self-Training

Yifei He; Haoxiang Wang; Ziyan Jiang; Alexandros Papangelis; Han Zhao

arXiv:2409.06903·cs.LG·September 12, 2024

Semi-Supervised Reward Modeling via Iterative Self-Training

Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, Han Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces SSRM, a semi-supervised method that improves reward modeling for language models by leveraging unlabeled data through iterative pseudo-labeling and confidence-based selection, reducing reliance on costly human annotations.

Contribution

The paper presents a novel semi-supervised reward modeling approach that enhances reward model training using unlabeled data with iterative pseudo-labeling and confidence filtering.

Findings

01

SSRM achieves comparable performance to fully labeled models.

02

Reduces dependence on human-annotated data.

03

Significantly lowers training costs and time.

Abstract

Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models (LLMs). Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RLHFlow/RLHF-Reward-Modeling
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsALIGN