A Baseline Analysis of Reward Models' Ability To Accurately Analyze   Foundation Models Under Distribution Shift

Will LeVine; Benjamin Pikus; Anthony Chen; Sean Hendryx

arXiv:2311.14743·cs.CL·January 25, 2024·1 cites

A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift

Will LeVine, Benjamin Pikus, Anthony Chen, Sean Hendryx

PDF

Open Access

TL;DR

This paper investigates how reward models used in aligning large language models perform under distribution shifts, revealing calibration issues and proposing an OOD detection method to identify shifts in prompts and responses.

Contribution

It provides the first systematic analysis of reward model robustness under distribution shifts and adapts OOD detection techniques for this setting.

Findings

01

Reward models show calibration issues under distribution shift.

02

Accuracy drops are more significant for responses than prompts.

03

An OOD detection method effectively identifies distribution shifts.

Abstract

Foundation models, specifically Large Language Models (LLMs), have lately gained wide-spread attention and adoption. Reinforcement Learning with Human Feedback (RLHF) involves training a reward model to capture desired behaviors, which is then used to align LLM's. These reward models are additionally used at inference-time to estimate LLM responses' adherence to those desired behaviors. However, there is little work measuring how robust these reward models are to distribution shifts. In this work, we evaluate how reward model performance - measured via accuracy and calibration (i.e. alignment between accuracy and confidence) - is affected by distribution shift. We show novel calibration patterns and accuracy drops due to OOD prompts and responses, and that the reward model is more sensitive to shifts in responses than prompts. Additionally, we adapt an OOD detection technique commonly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Engineering Research

MethodsALIGN