Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning
Elizaveta Reganova, Peter Steinbach

TL;DR
This paper evaluates the certainty and accuracy of large language models in answering physics questions, revealing that models are often accurate when certain but exhibit complex uncertainty patterns, especially with reasoning tasks.
Contribution
It introduces a method to analyze the relationship between uncertainty and accuracy in LLMs on physics questions, highlighting differences between knowledge retrieval and reasoning tasks.
Findings
Models are accurate when they are certain about answers.
Uncertainty and accuracy relationship forms a broad horizontal bell-shaped distribution.
As questions require more reasoning, the asymmetry between accuracy and uncertainty increases.
Abstract
Large Language Models (LLMs) have gained significant popularity in recent years for their ability to answer questions in various fields. However, these models have a tendency to "hallucinate" their responses, making it challenging to evaluate their performance. A major challenge is determining how to assess the certainty of a model's predictions and how it correlates with accuracy. In this work, we introduce an analysis for evaluating the performance of popular open-source LLMs, as well as gpt-3.5 Turbo, on multiple choice physics questionnaires. We focus on the relationship between answer accuracy and variability in topics related to physics. Our findings suggest that most models provide accurate replies in cases where they are certain, but this is by far not a general behavior. The relationship between accuracy and uncertainty exposes a broad horizontal bell-shaped distribution. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsFocus
