A Long Way to Go: Investigating Length Correlations in RLHF

Prasann Singhal; Tanya Goyal; Jiacheng Xu; Greg Durrett

arXiv:2310.03716·cs.CL·July 12, 2024·2 cites

A Long Way to Go: Investigating Length Correlations in RLHF

Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper reveals that reinforcement learning from human feedback (RLHF) often improves model responses primarily by increasing length, with reward models being a key source of this bias, affecting alignment efforts.

Contribution

It uncovers the significant influence of response length in RLHF improvements and identifies reward models as the main source of length bias in training dynamics.

Findings

01

Length optimization is a major factor in RLHF success.

02

Purely length-based rewards can replicate RLHF improvements.

03

Reward models are non-robust and influenced by length biases.

Abstract

Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The paper clearly documents the length-increasing issue in the RLHF pipeline. They conduct experiments on several datasets across different domains to demonstrate the issue.

Weaknesses

The paper is rather descriptive than prescriptive. Specifically, the authors describe the correlation between length increasing and standard PPO training without providing the underlying reason for the phenomenon. Although they propose several heuristic-inspired remedies, the problem is not fully resolved. Therefore, it might not directly contribute to improving the existing RLHF method.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The paper offers a well-executed investigation of a well-known pattern: the correlation between RLHF scores and length. The findings are persuasive and supportive of the conclusions

Weaknesses

The paper's comprehensibility is somewhat challenging. The conclusions drawn from the tables and figures lack clarity, and it's not easy to discern the key takeaway from the experiments presented. The paper would be improved with some rewriting and clarification. The experiments themselves are well-executed, and they do support the main message. However, it's worth noting that this pattern has been observed in numerous other studies, and strategies to address this bias/reward hacking have been

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

This paper focuses on a very important problem: what role does length play in RLHF? The paper conducts extensive experiments to demonstrate the correlation between length and reward model scores and explores several ways to mitigate length bias. The results can provide constructive guidance for future research.

Weaknesses

1. The most confusing part for me is the evaluation. What is the rationale for adopting a length-bias metric, GPT-4 evaluation [1], when evaluating the correlation between length and RLHF performance? Two potential factors can be affected by length, making it hard to disentangle the attribution of length correlation. Or do you aim to reveal the bias issue of GPT-4 in this paper? Then could you please explain more clearly what you mean by "length correlations in RLHF" and what the length correlat

Code & Models

Repositories

prasanns/rlhf-length-biases
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Expert finding and Q&A systems

MethodsALIGN