Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback
Jiayi Zhou, Jiaming Ji, Juntao Dai, Dong Li, Yaodong Yang

TL;DR
This paper introduces a sequence-to-sequence reward modeling approach for RLHF that leverages language feedback instead of scalar feedback, leading to improved alignment of large language models with human preferences across multiple tasks.
Contribution
The paper proposes a novel seq2seq reward modeling method that enhances RLHF by using language feedback, eliminating the need for additional annotations or training stages.
Findings
Reduces refusal-to-response in safety dialogues
Mitigates long-response bias in summarization
Achieves 76.9% win rate on NLP tasks
Abstract
Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and fine-tuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel \textit{sequence-to-sequence (seq2seq) reward modeling} method. Its key insight is that learning from language feedback rather than scalar feedback improves RLHF without additional annotations. We replaced the reward modeling target from binary maximum likelihood estimation (MLE) with sequence MLE. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
