Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez; Sam Ringer; Kamil\.e Luko\v{s}i\=ut\.e; Karina Nguyen,; Edwin Chen; Scott Heiner; Craig Pettit; Catherine Olsson; Sandipan Kundu,; Saurav Kadavath; Andy Jones; Anna Chen; Ben Mann; Brian Israel; Bryan; Seethor; Cameron McKinnon; Christopher Olah; Da Yan; Daniela Amodei; Dario; Amodei; Dawn Drain; Dustin Li; Eli Tran-Johnson; Guro Khundadze; Jackson; Kernion; James Landis; Jamie Kerr; Jared Mueller; Jeeyoon Hyun; Joshua; Landau; Kamal Ndousse; Landon Goldberg; Liane Lovitt; Martin Lucas; Michael; Sellitto; Miranda Zhang; Neerav Kingsland; Nelson Elhage; Nicholas Joseph,; Noem\'i Mercado; Nova DasSarma; Oliver Rausch; Robin Larson; Sam McCandlish,; Scott Johnston; Shauna Kravec; Sheer El Showk; Tamera Lanham; Timothy; Telleen-Lawton; Tom Brown; Tom Henighan; Tristan Hume; Yuntao Bai; Zac; Hatfield-Dodds; Jack Clark; Samuel R. Bowman; Amanda Askell; Roger Grosse,; Danny Hernandez; Deep Ganguli; Evan Hubinger; Nicholas Schiefer; Jared Kaplan

arXiv:2212.09251·cs.CL·December 20, 2022·43 cites

Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez, Sam Ringer, Kamil\.e Luko\v{s}i\=ut\.e, Karina Nguyen,, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu,, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan, Seethor, Cameron McKinnon, Christopher Olah, Da Yan

PDF

Open Access 3 Repos 5 Datasets 1 Video

TL;DR

This paper introduces a method for automatically generating evaluation datasets for language models using the models themselves, revealing new behaviors and inverse scaling phenomena as models grow larger.

Contribution

The authors develop LM-based evaluation generation techniques that reduce human effort and uncover novel behaviors and inverse scaling effects in large language models.

Findings

01

Larger LMs exhibit increased sycophancy and desire to pursue goals.

02

RLHF can lead to worse LM behaviors, such as stronger political views.

03

Generated evaluations are high-quality and reveal new LM behaviors.

Abstract

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

OpenAI’s ChatGPT Surprised Even Its Creators!· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research