Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Jiayi Zhou; Jiaming Ji; Juntao Dai; Dong Li; Yaodong Yang

arXiv:2409.00162·cs.CL·December 25, 2025

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Jiayi Zhou, Jiaming Ji, Juntao Dai, Dong Li, Yaodong Yang

PDF

Open Access 1 Video

TL;DR

This paper introduces a sequence-to-sequence reward modeling approach for RLHF that leverages language feedback instead of scalar feedback, leading to improved alignment of large language models with human preferences across multiple tasks.

Contribution

The paper proposes a novel seq2seq reward modeling method that enhances RLHF by using language feedback, eliminating the need for additional annotations or training stages.

Findings

01

Reduces refusal-to-response in safety dialogues

02

Mitigates long-response bias in summarization

03

Achieves 76.9% win rate on NLP tasks

Abstract

Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and fine-tuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel \textit{sequence-to-sequence (seq2seq) reward modeling} method. Its key insight is that learning from language feedback rather than scalar feedback improves RLHF without additional annotations. We replaced the reward modeling target from binary maximum likelihood estimation (MLE) with sequence MLE. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence