The False Promise of Imitating Proprietary LLMs

Arnav Gudibande; Eric Wallace; Charlie Snell; Xinyang Geng; Hao Liu,; Pieter Abbeel; Sergey Levine; Dawn Song

arXiv:2305.15717·cs.CL·May 26, 2023·51 cites

The False Promise of Imitating Proprietary LLMs

Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu,, Pieter Abbeel, Sergey Levine, Dawn Song

PDF

Open Access 1 Repo 3 Datasets

TL;DR

This paper critically examines the effectiveness of finetuning open-source language models to imitate proprietary models like ChatGPT, revealing significant limitations in capturing true capabilities despite superficial similarities.

Contribution

It provides a comprehensive analysis showing that imitation models fail to replicate the factuality and broader capabilities of proprietary models, emphasizing the need to improve base models instead.

Findings

01

Imitation models perform well in style but poorly in factual accuracy.

02

Crowd ratings favor imitation models, but automatic evaluations reveal large gaps.

03

Substantial data or more capable base models are needed to close the performance gap.

Abstract

An emerging method to cheaply improve a weaker language model is to finetune it on outputs from a stronger model, such as a proprietary system like ChatGPT (e.g., Alpaca, Self-Instruct, and others). This approach looks to cheaply imitate the proprietary model's capabilities using a weaker open-source model. In this work, we critically analyze this approach. We first finetune a series of LMs that imitate ChatGPT using varying base model sizes (1.5B--13B), data sources, and imitation data amounts (0.3M--150M tokens). We then evaluate the models using crowd raters and canonical NLP benchmarks. Initially, we were surprised by the output quality of our imitation models -- they appear far better at following instructions, and crowd workers rate their outputs as competitive with ChatGPT. However, when conducting more targeted automatic evaluations, we find that imitation models close little to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IBM/Dromedary
pytorch

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Software Engineering Research

MethodsNone · Balanced Selection