Reward Modeling with Weak Supervision for Language Models

Ben Hauptvogel; Malte Ostendorff; Georg Rehm; Sebastian M\"oller

arXiv:2410.20869·cs.CL·October 29, 2024

Reward Modeling with Weak Supervision for Language Models

Ben Hauptvogel, Malte Ostendorff, Georg Rehm, Sebastian M\"oller

PDF

Open Access 1 Repo

TL;DR

This paper explores using weak supervision techniques to expand and improve reward models in reinforcement learning from human feedback for language models, especially benefiting smaller datasets.

Contribution

It introduces a weak supervision approach for reward modeling, leveraging heuristics and label calibration to reduce dependence on manual annotations in RLHF.

Findings

01

Weak supervision improves reward model performance on small datasets.

02

Effectiveness diminishes with larger, manually labeled datasets.

03

LLM-generated responses can be weakly labeled to extend preference data.

Abstract

Recent advancements in large language models (LLMs) have led to their increased application across various tasks, with reinforcement learning from human feedback (RLHF) being a crucial part of their training to align responses with user intentions. In the RLHF process, a reward model is trained using responses preferences determined by human labelers or AI systems, which then refines the LLM through reinforcement learning. This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance. Weak supervision employs noisy or imprecise data labeling, reducing reliance on expensive manually labeled data. By analyzing RLHF datasets to identify heuristics that correlate with response preference, we wrote simple labeling functions and then calibrated a label model to weakly annotate unlabeled data. Our evaluation show that while weak supervision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DFKI-NLP/weak-supervision-rlhf
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsALIGN