A Long Way to Go: Investigating Length Correlations in RLHF
Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett

TL;DR
This paper reveals that reinforcement learning from human feedback (RLHF) often improves model responses primarily by increasing length, with reward models being a key source of this bias, affecting alignment efforts.
Contribution
It uncovers the significant influence of response length in RLHF improvements and identifies reward models as the main source of length bias in training dynamics.
Findings
Length optimization is a major factor in RLHF success.
Purely length-based rewards can replicate RLHF improvements.
Reward models are non-robust and influenced by length biases.
Abstract
Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we…
Peer Reviews
Decision·Submitted to ICLR 2024
The paper clearly documents the length-increasing issue in the RLHF pipeline. They conduct experiments on several datasets across different domains to demonstrate the issue.
The paper is rather descriptive than prescriptive. Specifically, the authors describe the correlation between length increasing and standard PPO training without providing the underlying reason for the phenomenon. Although they propose several heuristic-inspired remedies, the problem is not fully resolved. Therefore, it might not directly contribute to improving the existing RLHF method.
The paper offers a well-executed investigation of a well-known pattern: the correlation between RLHF scores and length. The findings are persuasive and supportive of the conclusions
The paper's comprehensibility is somewhat challenging. The conclusions drawn from the tables and figures lack clarity, and it's not easy to discern the key takeaway from the experiments presented. The paper would be improved with some rewriting and clarification. The experiments themselves are well-executed, and they do support the main message. However, it's worth noting that this pattern has been observed in numerous other studies, and strategies to address this bias/reward hacking have been
This paper focuses on a very important problem: what role does length play in RLHF? The paper conducts extensive experiments to demonstrate the correlation between length and reward model scores and explores several ways to mitigate length bias. The results can provide constructive guidance for future research.
1. The most confusing part for me is the evaluation. What is the rationale for adopting a length-bias metric, GPT-4 evaluation [1], when evaluating the correlation between length and RLHF performance? Two potential factors can be affected by length, making it hard to disentangle the attribution of length correlation. Or do you aim to reveal the bias issue of GPT-4 in this paper? Then could you please explain more clearly what you mean by "length correlations in RLHF" and what the length correlat
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Expert finding and Q&A systems
MethodsALIGN
