# A large-scale evaluation of commonsense knowledge in humans and large language models

**Authors:** Tuan Dung Nguyen, Duncan J Watts, Mark E Whiting

PMC · DOI: 10.1093/pnasnexus/pgag029 · PNAS Nexus · 2026-02-16

## TL;DR

This paper evaluates how well large language models capture human commonsense knowledge, finding that models often fall short of human agreement and that smaller models sometimes outperform larger ones.

## Contribution

The paper introduces a new evaluation framework for commonsense knowledge that accounts for human variability and cultural context.

## Key findings

- Most LLMs perform below the human median in individual commonsense competence.
- LLMs correlate only modestly with human agreement on statements when simulating populations.
- Smaller, open-weight models are more competitive than larger, proprietary models in some cases.

## Abstract

Commonsense knowledge, a major constituent of AI, is primarily evaluated in practice by human-prescribed ground-truth labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a method for assessing commonsense knowledge in AI, specifically in large language models (LLMs) that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model’s judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense knowledge to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.

## Full-text entities

- **Diseases:** toxicity (MESH:D064420), LLMs (MESH:D007806)
- **Chemicals:** Falcon-180B (-), silicon (MESH:D012825)
- **Species:** Homo sapiens (human, species) [taxon 9606], Felis catus (cat, species) [taxon 9685], Liphistius sp. LM (species) [taxon 1285381]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12947942/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12947942/full.md

## References

100 references — full list in the complete paper: https://tomesphere.com/paper/PMC12947942/full.md

---
Source: https://tomesphere.com/paper/PMC12947942