The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization
Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul,, Weixun Wang, Lewis Tunstall

TL;DR
This paper reproduces and analyzes the RLHF implementation for TL;DR summarization, revealing how response quality improves with model size and providing detailed insights and publicly available resources.
Contribution
It provides the first open reproduction of RLHF scaling behaviors for TL;DR summarization, detailing implementation and sharing trained models and code.
Findings
RLHF-trained models show response quality scales with size
Models of 2.8B and 6.9B outperform smaller benchmarks
Public release of models and code to support further research
Abstract
This work is the first to openly reproduce the Reinforcement Learning from Human Feedback (RLHF) scaling behaviors reported in OpenAI's seminal TL;DR summarization work. We create an RLHF pipeline from scratch, enumerate over 20 key implementation details, and share key insights during the reproduction. Our RLHF-trained Pythia models demonstrate significant gains in response quality that scale with model size, with our 2.8B, 6.9B models outperforming OpenAI's released 1.3B checkpoint. We publicly release the trained model checkpoints and code to facilitate further research and accelerate progress in the field (\url{https://github.com/vwxyzjn/summarize_from_feedback_details}).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services · Distributed and Parallel Computing Systems · Data Quality and Management
MethodsPythia
