Language Models (Mostly) Know What They Know

Saurav Kadavath; Tom Conerly; Amanda Askell; Tom Henighan; Dawn Drain,; Ethan Perez; Nicholas Schiefer; Zac Hatfield-Dodds; Nova DasSarma; Eli; Tran-Johnson; Scott Johnston; Sheer El-Showk; Andy Jones; Nelson Elhage,; Tristan Hume; Anna Chen; Yuntao Bai; Sam Bowman; Stanislav Fort; Deep; Ganguli; Danny Hernandez; Josh Jacobson; Jackson Kernion; Shauna Kravec,; Liane Lovitt; Kamal Ndousse; Catherine Olsson; Sam Ringer; Dario Amodei; Tom; Brown; Jack Clark; Nicholas Joseph; Ben Mann; Sam McCandlish; Chris Olah,; Jared Kaplan

arXiv:2207.05221·cs.CL·November 22, 2022·161 cites

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain,, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli, Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage,, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman

PDF

Open Access 1 Repo

TL;DR

This paper explores whether large language models can self-assess their answer correctness and knowledge certainty, showing promising calibration and potential for honest AI systems.

Contribution

It demonstrates that language models can effectively evaluate their own answers and predict their knowledge state, advancing self-assessment capabilities.

Findings

01

Models are well-calibrated on multiple choice questions.

02

Self-evaluation improves with multiple samples.

03

Models can predict their own knowledge probability.

Abstract

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iinemo/lm-polygraph
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification