Reward Modeling with Weak Supervision for Language Models
Ben Hauptvogel, Malte Ostendorff, Georg Rehm, Sebastian M\"oller

TL;DR
This paper explores using weak supervision techniques to expand and improve reward models in reinforcement learning from human feedback for language models, especially benefiting smaller datasets.
Contribution
It introduces a weak supervision approach for reward modeling, leveraging heuristics and label calibration to reduce dependence on manual annotations in RLHF.
Findings
Weak supervision improves reward model performance on small datasets.
Effectiveness diminishes with larger, manually labeled datasets.
LLM-generated responses can be weakly labeled to extend preference data.
Abstract
Recent advancements in large language models (LLMs) have led to their increased application across various tasks, with reinforcement learning from human feedback (RLHF) being a crucial part of their training to align responses with user intentions. In the RLHF process, a reward model is trained using responses preferences determined by human labelers or AI systems, which then refines the LLM through reinforcement learning. This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance. Weak supervision employs noisy or imprecise data labeling, reducing reliance on expensive manually labeled data. By analyzing RLHF datasets to identify heuristics that correlate with response preference, we wrote simple labeling functions and then calibrated a label model to weakly annotate unlabeled data. Our evaluation show that while weak supervision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsALIGN
