# How well do large language models mirror human cognition of word concepts?: A comparison of psychological ratings for early-acquired English words

**Authors:** Hiromichi Hagihara, Kazuki Miyazawa

PMC · DOI: 10.3758/s13428-025-02938-2 · Behavior Research Methods · 2026-02-02

## TL;DR

This study compares how well large language models (LLMs) match human psychological ratings for English words, finding strong alignment in some areas but notable differences in others.

## Contribution

The study introduces a novel framework for evaluating the cognitive plausibility of LLMs using lexical psychological features.

## Key findings

- LLMs showed strong alignment with human ratings for Concreteness and Bodily Interactiveness (rs > .82).
- LLMs diverged notably from human ratings for Iconicity and Arousal (rs < .48).
- Function words showed more pronounced discrepancies between human and LLM ratings compared to content words.

## Abstract

This study examined how well large language models (LLMs) approximate human psychological ratings for early-acquired English words. We used four state-of-the-art LLMs, including GPT-4o and Meta-Llama-3.1, to evaluate 21 static psychological features for 695 words and compared these estimates with human norms. The results showed that LLMs aligned well with human ratings for some features (e.g., Concreteness, Bodily Interactiveness) in terms of rank correlations (rs > .82) and distributional similarities but diverged notably for others (e.g., Iconicity, Arousal; rs < .48). Compared with content words, function words showed more pronounced discrepancies between human and LLM ratings. We also assessed how similarly human- and LLM-derived psychological features predicted words’ age of acquisition (AoA), revealing both strong correspondences and systematic biases, depending on the model (differences in correlations ranged from −.27 to .28). Based on these analyses, we identified which features may be reliably estimated using LLMs, which require further refinement, and what methodological considerations are necessary for applying LLM-based measures in cognitive science. We discuss the implications of using LLMs as methodological tools in psychology and cognitive science, highlighting both their practical advantages (e.g., data coverage and data collection efficiency) and theoretical relevance. The present study provides a novel framework for evaluating the cognitive plausibility of LLMs by using lexical psychological features, complementing existing benchmarks.

The online version contains supplementary material available at 10.3758/s13428-025-02938-2.

## Full-text entities

- **Diseases:** fire (MESH:D000092422), Babiness (MESH:D016750), CHILDES (MESH:C562515), LLMs (MESH:D007806), CDI (MESH:D020790)
- **Chemicals:** Gpt-4o (-)
- **Species:** Gallus gallus (bantam, species) [taxon 9031], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12864368/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12864368/full.md

## References

22 references — full list in the complete paper: https://tomesphere.com/paper/PMC12864368/full.md

---
Source: https://tomesphere.com/paper/PMC12864368