A Critical Evaluation of AI Feedback for Aligning Large Language Models

Archit Sharma; Sedrick Keh; Eric Mitchell; Chelsea Finn; Kushal Arora,; Thomas Kollar

arXiv:2402.12366·cs.LG·February 20, 2024·1 cites

A Critical Evaluation of AI Feedback for Aligning Large Language Models

Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora,, Thomas Kollar

PDF

Open Access 1 Repo 1 Datasets 1 Reviews

TL;DR

This paper critically examines reinforcement learning with AI feedback (RLAIF) for large language models, revealing that simpler supervised fine-tuning with stronger teachers can outperform complex RL methods.

Contribution

It demonstrates that supervised fine-tuning with GPT-4 can surpass RLAIF, and provides insights into when RLAIF is beneficial versus when simpler methods suffice.

Findings

01

Supervised fine-tuning with GPT-4 outperforms RLAIF pipelines.

02

The benefits of RLAIF depend on model families and evaluation protocols.

03

RLAIF's improvements are largely due to the quality of the teacher model.

Abstract

Reinforcement learning with AI feedback (RLAIF) is a popular paradigm for improving the instruction-following abilities of powerful pre-trained language models. RLAIF first performs supervised fine-tuning (SFT) using demonstrations from a teacher model and then further fine-tunes the model with reinforcement learning (RL), using feedback from a critic model. While recent popular open-source models have demonstrated substantial improvements in performance from the RL step, in this paper we question whether the complexity of this RL step is truly warranted for AI feedback. We show that the improvements of the RL step are virtually entirely due to the widespread practice of using a weaker teacher model (e.g. GPT-3.5) for SFT data collection than the critic (e.g., GPT-4) used for AI feedback generation. Specifically, we show that simple supervised fine-tuning with GPT-4 as the teacher…

Peer Reviews

Decision·NeurIPS 2024 poster

Reviewer 01Rating 6Confidence 3

Strengths

- The paper is well-written and easy to follow. - The authors are addressing a significant problem. - The experiments were really well designed and performed. - The authors show the robustness of their observation by running experiments across several models and dataset splits. - The authors not only identified the problem in LAIF but also provided some possible explanations that enhanced the reader's understanding of it. - LAIF is an important path forward for improving LLM instruction followin

Weaknesses

- Some of the observations in the paper are straightforward. - A few more experiments should be included in the paper to complete some of its conclusions. - The 10% rule doesn't always hold true. In Figure 3 and Figure 4, SFT 10% performs worse than SFT 100%. A better split could improve SFT performance when doing SFT + LAIF, which is important based on the paper's conclusions. - The LAIF ablation experiments for addressing the LAIF ineffectiveness, either as preference data or as the base model

Code & Models

Repositories

architsharma97/dpo-rlaif
pytorchOfficial

Datasets

argilla/OpenHermesPreferences
dataset· 704 dl
704 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Reinforcement Learning from AI Feedback · Label Smoothing · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Transformer