How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency
Danna Zheng, Mirella Lapata, Jeff Z. Pan

TL;DR
This paper critically examines the reliability of large language models as knowledge bases by introducing new evaluation criteria, datasets, and metrics focused on factuality and consistency, revealing significant challenges in their use.
Contribution
It proposes a comprehensive framework for evaluating LLMs as knowledge bases, including the UnseenQA dataset and new metrics for factuality and consistency.
Findings
26 LLMs show significant variability in factuality and consistency
Current evaluation methods are insufficient for assessing LLM reliability
Highlighting the need for more rigorous and holistic evaluation approaches
Abstract
Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs), yet current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance. In this work, we rethink the requirements for evaluating reliable LLM-as-KB usage and highlight two essential factors: factuality, ensuring accurate responses to seen and unseen knowledge, and consistency, maintaining stable answers to questions about the same knowledge. We introduce UnseenQA, a dataset designed to assess LLM performance on unseen knowledge, and propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score. Our experiments on 26 LLMs reveal several challenges regarding their use as KBs, underscoring the need for more principled and comprehensive evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Focus · Cosine Annealing · Linear Layer · Weight Decay · Softmax · Multi-Head Attention · Dense Connections
