Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini,, Rishabh Agarwal, Aaron Courville

TL;DR
This paper introduces asynchronous RLHF, a method that separates generation and learning to enable faster, off-policy training of language models, achieving significant speedups while maintaining performance.
Contribution
It proposes a novel asynchronous RLHF framework that improves training efficiency and scalability for language models by decoupling sample generation from learning.
Findings
Asynchronous RLHF speeds up training by ~40% for chatbots.
Robustness to off-policy data increases with model size.
Achieves ~70% faster fine-tuning on GSM8k with maintained accuracy.
Abstract
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model which give a worse training signal. We tackle the fundamental challenge in this regime: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · LLaMA · Direct Preference Optimization
