Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language   Models

Michael Noukhovitch; Shengyi Huang; Sophie Xhonneux; Arian Hosseini,; Rishabh Agarwal; Aaron Courville

arXiv:2410.18252·cs.LG·April 29, 2025

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini,, Rishabh Agarwal, Aaron Courville

PDF

Open Access 1 Repo

TL;DR

This paper introduces asynchronous RLHF, a method that separates generation and learning to enable faster, off-policy training of language models, achieving significant speedups while maintaining performance.

Contribution

It proposes a novel asynchronous RLHF framework that improves training efficiency and scalability for language models by decoupling sample generation from learning.

Findings

01

Asynchronous RLHF speeds up training by ~40% for chatbots.

02

Robustness to off-policy data increases with model size.

03

Achieves ~70% faster fine-tuning on GSM8k with maintained accuracy.

Abstract

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model which give a worse training signal. We tackle the fundamental challenge in this regime: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mnoukhov/async_rlhf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · LLaMA · Direct Preference Optimization