Learning Goal-Conditioned Representations for Language Reward Models
Vaskar Nath, Dylan Slack, Jeff Da, Yuntao Ma, Hugh Zhang, Spencer, Whitehead, Sean Hendryx

TL;DR
This paper introduces a goal-conditioned contrastive training method for reward models that enhances their performance, steerability, and ability to control language model outputs, leading to better alignment and cost savings.
Contribution
The authors propose a novel contrastive, goal-conditioned training approach for reward models that improves performance, steerability, and fine-grained control in language model alignment.
Findings
Up to 0.09 AUROC improvement on benchmarks like MATH and GSM8k.
2.3% increase in Helpful-Harmless dataset accuracy.
Filtering 55% of generated tokens improves cost efficiency.
Abstract
Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning (RL). Nevertheless, it is unclear how improved representation learning can benefit reinforcement learning from human feedback (RLHF) on language models (LMs). In this work, we propose training reward models (RMs) in a contrastive, fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories. This objective significantly improves RM performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well -- on the Helpful-Harmless dataset, we observe increase in accuracy. Beyond improving reward model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsLLaMA
