Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain,, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli, Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage,, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman

TL;DR
This paper explores whether large language models can self-assess their answer correctness and knowledge certainty, showing promising calibration and potential for honest AI systems.
Contribution
It demonstrates that language models can effectively evaluate their own answers and predict their knowledge state, advancing self-assessment capabilities.
Findings
Models are well-calibrated on multiple choice questions.
Self-evaluation improves with multiple samples.
Models can predict their own knowledge probability.
Abstract
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
