Learning Goal-Conditioned Representations for Language Reward Models

Vaskar Nath; Dylan Slack; Jeff Da; Yuntao Ma; Hugh Zhang; Spencer; Whitehead; Sean Hendryx

arXiv:2407.13887·cs.CL·October 25, 2024

Learning Goal-Conditioned Representations for Language Reward Models

Vaskar Nath, Dylan Slack, Jeff Da, Yuntao Ma, Hugh Zhang, Spencer, Whitehead, Sean Hendryx

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a goal-conditioned contrastive training method for reward models that enhances their performance, steerability, and ability to control language model outputs, leading to better alignment and cost savings.

Contribution

The authors propose a novel contrastive, goal-conditioned training approach for reward models that improves performance, steerability, and fine-grained control in language model alignment.

Findings

01

Up to 0.09 AUROC improvement on benchmarks like MATH and GSM8k.

02

2.3% increase in Helpful-Harmless dataset accuracy.

03

Filtering 55% of generated tokens improves cost efficiency.

Abstract

Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning (RL). Nevertheless, it is unclear how improved representation learning can benefit reinforcement learning from human feedback (RLHF) on language models (LMs). In this work, we propose training reward models (RMs) in a contrastive, $goal-conditioned$ fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories. This objective significantly improves RM performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well -- on the Helpful-Harmless dataset, we observe $2.3%$ increase in accuracy. Beyond improving reward model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vaskarnathscale/goal-conditioned-rm
pytorchOfficial

Videos

Learning Goal-Conditioned Representations for Language Reward Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsLLaMA