A Critical Evaluation of AI Feedback for Aligning Large Language Models
Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora,, Thomas Kollar

TL;DR
This paper critically examines reinforcement learning with AI feedback (RLAIF) for large language models, revealing that simpler supervised fine-tuning with stronger teachers can outperform complex RL methods.
Contribution
It demonstrates that supervised fine-tuning with GPT-4 can surpass RLAIF, and provides insights into when RLAIF is beneficial versus when simpler methods suffice.
Findings
Supervised fine-tuning with GPT-4 outperforms RLAIF pipelines.
The benefits of RLAIF depend on model families and evaluation protocols.
RLAIF's improvements are largely due to the quality of the teacher model.
Abstract
Reinforcement learning with AI feedback (RLAIF) is a popular paradigm for improving the instruction-following abilities of powerful pre-trained language models. RLAIF first performs supervised fine-tuning (SFT) using demonstrations from a teacher model and then further fine-tunes the model with reinforcement learning (RL), using feedback from a critic model. While recent popular open-source models have demonstrated substantial improvements in performance from the RL step, in this paper we question whether the complexity of this RL step is truly warranted for AI feedback. We show that the improvements of the RL step are virtually entirely due to the widespread practice of using a weaker teacher model (e.g. GPT-3.5) for SFT data collection than the critic (e.g., GPT-4) used for AI feedback generation. Specifically, we show that simple supervised fine-tuning with GPT-4 as the teacher…
Peer Reviews
Decision·NeurIPS 2024 poster
- The paper is well-written and easy to follow. - The authors are addressing a significant problem. - The experiments were really well designed and performed. - The authors show the robustness of their observation by running experiments across several models and dataset splits. - The authors not only identified the problem in LAIF but also provided some possible explanations that enhanced the reader's understanding of it. - LAIF is an important path forward for improving LLM instruction followin
- Some of the observations in the paper are straightforward. - A few more experiments should be included in the paper to complete some of its conclusions. - The 10% rule doesn't always hold true. In Figure 3 and Figure 4, SFT 10% performs worse than SFT 100%. A better split could improve SFT performance when doing SFT + LAIF, which is important based on the paper's conclusions. - The LAIF ablation experiments for addressing the LAIF ineffectiveness, either as preference data or as the base model
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Reinforcement Learning from AI Feedback · Label Smoothing · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Transformer
