Evaluating the Unseen Capabilities: How Many Theorems Do LLMs Know?

Xiang Li; Jiayi Xin; Qi Long; Weijie J. Su

arXiv:2506.02058·cs.CL·June 4, 2025

Evaluating the Unseen Capabilities: How Many Theorems Do LLMs Know?

Xiang Li, Jiayi Xin, Qi Long, Weijie J. Su

PDF

Open Access

TL;DR

This paper introduces KnowSum, a statistical framework that estimates the unseen knowledge in LLMs to improve evaluation accuracy, revealing that current assessments often overlook substantial internal knowledge and can misrank models.

Contribution

The paper presents KnowSum, a novel method to quantify unseen knowledge in LLMs, enhancing evaluation methods by accounting for unobserved information.

Findings

01

Significant unseen knowledge is omitted in current evaluations.

02

KnowSum provides more accurate estimates of total LLM knowledge.

03

Model rankings change notably when unseen knowledge is considered.

Abstract

Accurate evaluation of large language models (LLMs) is crucial for understanding their capabilities and guiding their development. However, current evaluations often inconsistently reflect the actual capacities of these models. In this paper, we demonstrate that one of many contributing factors to this \textit{evaluation crisis} is the oversight of unseen knowledge -- information encoded by LLMs but not directly observed or not yet observed during evaluations. We introduce KnowSum, a statistical framework designed to provide a more comprehensive assessment by quantifying the unseen knowledge for a class of evaluation tasks. KnowSum estimates the unobserved portion by extrapolating from the appearance frequencies of observed knowledge instances. We demonstrate the effectiveness and utility of KnowSum across three critical applications: estimating total knowledge, evaluating information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLibrary Science and Information Systems · Artificial Intelligence in Law