RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with   AI Feedback

Harrison Lee; Samrat Phatale; Hassan Mansoor; Thomas Mesnard; Johan; Ferret; Kellie Lu; Colton Bishop; Ethan Hall; Victor Carbune; Abhinav; Rastogi; Sushant Prakash

arXiv:2309.00267·cs.CL·September 4, 2024·67 cites

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan, Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav, Rastogi, Sushant Prakash

PDF

Open Access

TL;DR

This paper compares RLAIF, an AI feedback-based reinforcement learning method, with RLHF, showing RLAIF's comparable performance and potential for scalable, self-improving language model alignment without expensive human labels.

Contribution

The paper introduces RLAIF and d-RLAIF, novel methods that replace human feedback with AI-generated preferences, demonstrating their effectiveness and scalability in aligning large language models.

Findings

01

RLAIF achieves similar performance to RLHF across multiple tasks.

02

RLAIF can outperform supervised baselines even when AI labelers are as large as the policy.

03

d-RLAIF surpasses canonical RLAIF by directly obtaining rewards from LLMs.

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques