Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Jiaqing Yuan; Lin Pan; Chung-Wei Hang; Jiang Guo; Jiarong Jiang; Bonan; Min; Patrick Ng; Zhiguo Wang

arXiv:2404.16164·cs.CL·April 26, 2024

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan, Min, Patrick Ng, Zhiguo Wang

PDF

Open Access

TL;DR

This paper introduces FACT-BENCH, a comprehensive benchmark for evaluating LLMs' ability to recall factual knowledge, revealing impacts of instruction tuning, model size, and exemplars on knowledge recall performance.

Contribution

The paper presents FACT-BENCH, a new benchmark for holistic evaluation of LLMs' factual recall, and provides insights into factors affecting knowledge retention and the effects of fine-tuning strategies.

Findings

01

Instruction-tuning reduces factual recall performance.

02

Larger models outperform smaller ones across families.

03

Fine-tuning on known knowledge improves recall.

Abstract

Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations · Law, Economics, and Judicial Systems

MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Dense Connections · Label Smoothing