Evaluating the Unseen Capabilities: How Many Theorems Do LLMs Know?
Xiang Li, Jiayi Xin, Qi Long, Weijie J. Su

TL;DR
This paper introduces KnowSum, a statistical framework that estimates the unseen knowledge in LLMs to improve evaluation accuracy, revealing that current assessments often overlook substantial internal knowledge and can misrank models.
Contribution
The paper presents KnowSum, a novel method to quantify unseen knowledge in LLMs, enhancing evaluation methods by accounting for unobserved information.
Findings
Significant unseen knowledge is omitted in current evaluations.
KnowSum provides more accurate estimates of total LLM knowledge.
Model rankings change notably when unseen knowledge is considered.
Abstract
Accurate evaluation of large language models (LLMs) is crucial for understanding their capabilities and guiding their development. However, current evaluations often inconsistently reflect the actual capacities of these models. In this paper, we demonstrate that one of many contributing factors to this \textit{evaluation crisis} is the oversight of unseen knowledge -- information encoded by LLMs but not directly observed or not yet observed during evaluations. We introduce KnowSum, a statistical framework designed to provide a more comprehensive assessment by quantifying the unseen knowledge for a class of evaluation tasks. KnowSum estimates the unobserved portion by extrapolating from the appearance frequencies of observed knowledge instances. We demonstrate the effectiveness and utility of KnowSum across three critical applications: estimating total knowledge, evaluating information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLibrary Science and Information Systems · Artificial Intelligence in Law
