KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Jifan Yu; Xiaozhi Wang; Shangqing Tu; Shulin Cao; Daniel Zhang-Li; Xin; Lv; Hao Peng; Zijun Yao; Xiaohan Zhang; Hanming Li; Chunyang Li; Zheyuan; Zhang; Yushi Bai; Yantao Liu; Amy Xin; Nianyi Lin; Kaifeng Yun; Linlu Gong,; Jianhui Chen; Zhili Wu; Yunjia Qi; Weikai Li; Yong Guan; Kaisheng Zeng; Ji; Qi; Hailong Jin; Jinxin Liu; Yu Gu; Yuan Yao; Ning Ding; Lei Hou; Zhiyuan; Liu; Bin Xu; Jie Tang; Juanzi Li

arXiv:2306.09296·cs.CL·July 2, 2024·24 cites

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin, Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan, Zhang, Yushi Bai, Yantao Liu, Amy Xin, Nianyi Lin, Kaifeng Yun, Linlu Gong,, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces KoLA, a comprehensive benchmark for evaluating large language models' world knowledge, emphasizing meticulous design, diverse data sources, and innovative evaluation metrics to ensure fair and insightful assessments.

Contribution

We developed KoLA, a knowledge-oriented benchmark with a four-level ability taxonomy, diverse data sources including emerging corpora, and novel evaluation metrics for thorough LLM assessment.

Findings

01

28 LLMs evaluated revealing varied knowledge capabilities

02

KoLA's contrastive metrics provide nuanced insights into LLM knowledge creation

03

Benchmark and leaderboard publicly available for ongoing research

Abstract

The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For \textbf{ability modeling}, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For \textbf{data}, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For \textbf{evaluation criteria}, we adopt a contrastive system,…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

The tools and data from the paper will be released upon acceptance. The community will benefit from such a large scale analysis over knowledge-related abilities with known and evolving sources. The presented analysis is already very insightful. The breakdown of the task using knowledge-related abilities with known and evolving sources is compelling to assess LLMs evolving capabilities.

Weaknesses

I don’t see any major weaknesses in the paper. One could argue that it is just an analysis paper, some of the insights that were drawn here might not be novel. But I feel that the presented framework will be valuable to the community. The authors have done a very good job explaining the framework in detail.

Reviewer 02Rating 8· accept, good paperConfidence 2

Strengths

- The knowledge benchmark fills the blank of thoroughly evaluating world knowledge of current large language models. The taxonomy is carefully designed and rich experiments on model choices are conducted. - The ever evolving setup has long-term benefit in considering the generalization problem in knowledge-intensive tasks. - The self-contrast metric has a good motivation in balancing the hallucination in knowledge-based generation. - The annotation team is of strong educational background.

Weaknesses

- The knowledge-wise strength of Rouge-L used in Eq. 3 doesn't look to be strong enough in capturing the knowledge association especially when T is a free-generation result, maybe replacing the measurement with another model (e.g. a entailment model) would be better? (w/ additional computational cost)

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

S1. The paper presents a new LLM benchmark with some innovations, including constructing benchmark with emerging corpora, evaluating model's knowledge creation capability. S2. The paper evaluated major SOTA LLMs, providing good comparison in model's capability from different perspective. S3. The paper reads well, easy to follow.

Weaknesses

W1. Benchmark on emerging corpora is a great idea and it is quite encouraging to see the authors promised to refresh the benchmark regularly. However, it is not clear how to maintain such benchmark in the long term. W2. It is not clear why we need another new LLM benchmark. Given all different benchmarks available publicly, I am not convinced KoLA is a must-have addition. W3. It is not clear why the standardized overall scoring can give better idea than simple ranking.

Code & Models

Repositories

thu-keg/kola
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification