KoLA: Carefully Benchmarking World Knowledge of Large Language Models
Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin, Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan, Zhang, Yushi Bai, Yantao Liu, Amy Xin, Nianyi Lin, Kaifeng Yun, Linlu Gong,, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan

TL;DR
This paper introduces KoLA, a comprehensive benchmark for evaluating large language models' world knowledge, emphasizing meticulous design, diverse data sources, and innovative evaluation metrics to ensure fair and insightful assessments.
Contribution
We developed KoLA, a knowledge-oriented benchmark with a four-level ability taxonomy, diverse data sources including emerging corpora, and novel evaluation metrics for thorough LLM assessment.
Findings
28 LLMs evaluated revealing varied knowledge capabilities
KoLA's contrastive metrics provide nuanced insights into LLM knowledge creation
Benchmark and leaderboard publicly available for ongoing research
Abstract
The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For \textbf{ability modeling}, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering tasks. (2) For \textbf{data}, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For \textbf{evaluation criteria}, we adopt a contrastive system,…
Peer Reviews
Decision·ICLR 2024 poster
The tools and data from the paper will be released upon acceptance. The community will benefit from such a large scale analysis over knowledge-related abilities with known and evolving sources. The presented analysis is already very insightful. The breakdown of the task using knowledge-related abilities with known and evolving sources is compelling to assess LLMs evolving capabilities.
I don’t see any major weaknesses in the paper. One could argue that it is just an analysis paper, some of the insights that were drawn here might not be novel. But I feel that the presented framework will be valuable to the community. The authors have done a very good job explaining the framework in detail.
- The knowledge benchmark fills the blank of thoroughly evaluating world knowledge of current large language models. The taxonomy is carefully designed and rich experiments on model choices are conducted. - The ever evolving setup has long-term benefit in considering the generalization problem in knowledge-intensive tasks. - The self-contrast metric has a good motivation in balancing the hallucination in knowledge-based generation. - The annotation team is of strong educational background.
- The knowledge-wise strength of Rouge-L used in Eq. 3 doesn't look to be strong enough in capturing the knowledge association especially when T is a free-generation result, maybe replacing the measurement with another model (e.g. a entailment model) would be better? (w/ additional computational cost)
S1. The paper presents a new LLM benchmark with some innovations, including constructing benchmark with emerging corpora, evaluating model's knowledge creation capability. S2. The paper evaluated major SOTA LLMs, providing good comparison in model's capability from different perspective. S3. The paper reads well, easy to follow.
W1. Benchmark on emerging corpora is a great idea and it is quite encouraging to see the authors promised to refresh the benchmark regularly. However, it is not clear how to maintain such benchmark in the long term. W2. It is not clear why we need another new LLM benchmark. Given all different benchmarks available publicly, I am not convinced KoLA is a must-have addition. W3. It is not clear why the standardized overall scoring can give better idea than simple ranking.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
