The False Promise of Imitating Proprietary LLMs
Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu,, Pieter Abbeel, Sergey Levine, Dawn Song

TL;DR
This paper critically examines the effectiveness of finetuning open-source language models to imitate proprietary models like ChatGPT, revealing significant limitations in capturing true capabilities despite superficial similarities.
Contribution
It provides a comprehensive analysis showing that imitation models fail to replicate the factuality and broader capabilities of proprietary models, emphasizing the need to improve base models instead.
Findings
Imitation models perform well in style but poorly in factual accuracy.
Crowd ratings favor imitation models, but automatic evaluations reveal large gaps.
Substantial data or more capable base models are needed to close the performance gap.
Abstract
An emerging method to cheaply improve a weaker language model is to finetune it on outputs from a stronger model, such as a proprietary system like ChatGPT (e.g., Alpaca, Self-Instruct, and others). This approach looks to cheaply imitate the proprietary model's capabilities using a weaker open-source model. In this work, we critically analyze this approach. We first finetune a series of LMs that imitate ChatGPT using varying base model sizes (1.5B--13B), data sources, and imitation data amounts (0.3M--150M tokens). We then evaluate the models using crowd raters and canonical NLP benchmarks. Initially, we were surprised by the output quality of our imitation models -- they appear far better at following instructions, and crowd workers rate their outputs as competitive with ChatGPT. However, when conducting more targeted automatic evaluations, we find that imitation models close little to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Software Engineering Research
MethodsNone · Balanced Selection
