A large-scale evaluation of commonsense knowledge in humans and large language models
Tuan Dung Nguyen, Duncan J. Watts, Mark E. Whiting

TL;DR
This paper evaluates how well large language models understand commonsense knowledge by comparing their judgments to diverse human opinions, revealing that models often underperform humans and vary in agreement, especially across different model sizes.
Contribution
It introduces a novel evaluation framework that accounts for human heterogeneity, assessing LLMs' commonsense knowledge relative to diverse human populations.
Findings
Most LLMs score below human median in commonsense competence.
LLMs show only modest correlation with human judgments.
Smaller, open models outperform larger, proprietary models in this evaluation.
Abstract
Commonsense knowledge, a major constituent of artificial intelligence (AI), is primarily evaluated in practice by human-prescribed ground-truth labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a method for assessing commonsense knowledge in AI, specifically in large language models (LLMs), that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model's judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
