Measuring Progress on Scalable Oversight for Large Language Models

Samuel R. Bowman; Jeeyoon Hyun; Ethan Perez; Edwin Chen; Craig Pettit,; Scott Heiner; Kamil\.e Luko\v{s}i\=ut\.e; Amanda Askell; Andy Jones; Anna; Chen; Anna Goldie; Azalia Mirhoseini; Cameron McKinnon; Christopher Olah,; Daniela Amodei; Dario Amodei; Dawn Drain; Dustin Li; Eli Tran-Johnson,; Jackson Kernion; Jamie Kerr; Jared Mueller; Jeffrey Ladish; Joshua Landau,; Kamal Ndousse; Liane Lovitt; Nelson Elhage; Nicholas Schiefer; Nicholas; Joseph; Noem\'i Mercado; Nova DasSarma; Robin Larson; Sam McCandlish,; Sandipan Kundu; Scott Johnston; Shauna Kravec; Sheer El Showk; Stanislav; Fort; Timothy Telleen-Lawton; Tom Brown; Tom Henighan; Tristan Hume; Yuntao; Bai; Zac Hatfield-Dodds; Ben Mann; Jared Kaplan

arXiv:2211.03540·cs.HC·November 15, 2022·31 cites

Measuring Progress on Scalable Oversight for Large Language Models

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit,, Scott Heiner, Kamil\.e Luko\v{s}i\=ut\.e, Amanda Askell, Andy Jones, Anna, Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah,, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li

PDF

Open Access 1 Video

TL;DR

This paper explores scalable oversight for large language models by designing empirical experiments where human specialists outperform AI and unaided humans, demonstrating the potential for effective supervision of advanced AI systems.

Contribution

It introduces an experimental framework for studying scalable oversight and provides proof-of-concept results showing human-AI collaboration improves performance on complex tasks.

Findings

01

Humans with AI outperform AI alone and unaided humans on specific tasks.

02

Chat-based interactions with language models can enhance human performance.

03

Scalable oversight research is feasible with current large language models.

Abstract

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

This is what happens when you let AIs debate· youtube

Taxonomy

TopicsTopic Modeling · Multi-Agent Systems and Negotiation · Speech and dialogue systems

Methodsfail