Semi-Supervised Reward Modeling via Iterative Self-Training
Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, Han Zhao

TL;DR
This paper introduces SSRM, a semi-supervised method that improves reward modeling for language models by leveraging unlabeled data through iterative pseudo-labeling and confidence-based selection, reducing reliance on costly human annotations.
Contribution
The paper presents a novel semi-supervised reward modeling approach that enhances reward model training using unlabeled data with iterative pseudo-labeling and confidence filtering.
Findings
SSRM achieves comparable performance to fully labeled models.
Reduces dependence on human-annotated data.
Significantly lowers training costs and time.
Abstract
Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models (LLMs). Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsALIGN
