The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR   Summarization

Shengyi Huang; Michael Noukhovitch; Arian Hosseini; Kashif Rasul,; Weixun Wang; Lewis Tunstall

arXiv:2403.17031·cs.LG·March 27, 2024·2 cites

The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization

Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul,, Weixun Wang, Lewis Tunstall

PDF

Open Access 1 Repo

TL;DR

This paper reproduces and analyzes the RLHF implementation for TL;DR summarization, revealing how response quality improves with model size and providing detailed insights and publicly available resources.

Contribution

It provides the first open reproduction of RLHF scaling behaviors for TL;DR summarization, detailing implementation and sharing trained models and code.

Findings

01

RLHF-trained models show response quality scales with size

02

Models of 2.8B and 6.9B outperform smaller benchmarks

03

Public release of models and code to support further research

Abstract

This work is the first to openly reproduce the Reinforcement Learning from Human Feedback (RLHF) scaling behaviors reported in OpenAI's seminal TL;DR summarization work. We create an RLHF pipeline from scratch, enumerate over 20 key implementation details, and share key insights during the reproduction. Our RLHF-trained Pythia models demonstrate significant gains in response quality that scale with model size, with our 2.8B, 6.9B models outperforming OpenAI's released 1.3B checkpoint. We publicly release the trained model checkpoints and code to facilitate further research and accelerate progress in the field (\url{https://github.com/vwxyzjn/summarize_from_feedback_details}).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vwxyzjn/summarize_from_feedback_details
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsService-Oriented Architecture and Web Services · Distributed and Parallel Computing Systems · Data Quality and Management

MethodsPythia