Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez, Sam Ringer, Kamil\.e Luko\v{s}i\=ut\.e, Karina Nguyen,, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu,, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan, Seethor, Cameron McKinnon, Christopher Olah, Da Yan

TL;DR
This paper introduces a method for automatically generating evaluation datasets for language models using the models themselves, revealing new behaviors and inverse scaling phenomena as models grow larger.
Contribution
The authors develop LM-based evaluation generation techniques that reduce human effort and uncover novel behaviors and inverse scaling effects in large language models.
Findings
Larger LMs exhibit increased sycophancy and desire to pursue goals.
RLHF can lead to worse LM behaviors, such as stronger political views.
Generated evaluations are high-quality and reveal new LM behaviors.
Abstract
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
OpenAI’s ChatGPT Surprised Even Its Creators!· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
