Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
Yueqin Yin, Shentao Yang, Yujia Xie, Ziyi Yang, Yuting Sun, Hany, Awadalla, Weizhu Chen, and Mingyuan Zhou

TL;DR
This paper introduces a segment-level reward model for RLHF in language models, improving reward assignment by considering semantically complete text segments, leading to better alignment with human preferences.
Contribution
It proposes a novel segment-level reward learning approach that allows dynamic segmentation and dense reward interpolation, enhancing RLHF effectiveness.
Findings
Achieves competitive performance on three RLHF benchmarks.
Demonstrates the effectiveness of segment-based rewards through ablation studies.
Provides a scalable method compatible with standard preference datasets.
Abstract
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗yyqoni/Phi-3-mini-4k-instruct-segment-rm-700kmodel· 2 dl2 dl
- 🤗yyqoni/Phi-3-mini-4k-instruct-token-rm-700kmodel· 5 dl5 dl
- 🤗yyqoni/Phi-3-mini-4k-instruct-bandit-rm-700kmodel· 5 dl5 dl
- 🤗yyqoni/rlhflow-llama-3-sft-8b-v2-segment-rm-700kmodel
- 🤗yyqoni/rlhflow-llama-3-sft-8b-v2-token-rm-700kmodel· 4 dl4 dl
- 🤗yyqoni/rlhflow-llama-3-sft-8b-v2-bandit-rm-700kmodel· 1 dl1 dl
- 🤗yyqoni/meta-llama-3.1-instruct-8b-token-rm-700kmodel· 1 dl1 dl
- 🤗yyqoni/meta-llama-3.1-instruct-8b-bandit-rm-700kmodel· 4 dl4 dl
- 🤗yyqoni/meta-llama-3.1-instruct-8b-segment-rm-700kmodel
- 🤗yyqoni/rlhflow-llama-3-sft-8b-v2-segment-ppo-60kmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsALIGN
