Statistical Knowledge Assessment for Large Language Models
Qingxiu Dong, Jingjing Xu, Lingpeng Kong, Zhifang Sui, Lei Li

TL;DR
This paper introduces KaRR, a statistical method to quantify factual knowledge in large language models by estimating the likelihood of correct answers across diverse prompts, correlating well with human judgment.
Contribution
The paper presents KaRR, a novel statistical approach for assessing factual knowledge in LLMs, along with a comprehensive evaluation suite and analysis of model scaling and tuning effects.
Findings
KaRR correlates strongly (0.43 Kendall's τ) with human assessments.
Model scaling laws hold for knowledge retention in LLMs.
Instruction tuning may reduce factual reliability in LLMs.
Abstract
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers? Existing LLMs may generate distinct responses for different prompts. In this paper, we study the problem of quantifying knowledge contained in an LLM regarding a given set of facts. We propose KaRR, a statistical approach to assess factual knowledge for LLMs. The main idea is to estimate the ratio of LLM generating text corresponding to the answer entity given diverse prompts of the subject and the querying relation, versus it generating by random chances. Our assessment suite contains a comprehensive set of 994,123 entities and 600 relations, with 1,395,905 text aliases. We use our method to evaluate 20 LLMs of various sizes, including LLaMA, Alpaca, OPT, etc. Experiments show that our results have a strong correlation (0.43 Kendall's ) with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
