# Fine-grained evaluation of large language models in medicine using non-parametric cognitive diagnostic modeling

**Authors:** Tianpeng Zheng, Jiayi Liu, Shicong Feng, Zhehan Jiang

PMC · DOI: 10.1038/s41598-026-36627-7 · Scientific Reports · 2026-01-28

## TL;DR

This paper introduces a new way to evaluate large language models in medicine by identifying their strengths and weaknesses in specific medical fields.

## Contribution

The study introduces a non-parametric cognitive diagnostic approach to evaluate LLMs across 22 medical subdomains.

## Key findings

- LLMs show mastery in Cardiology, Dermatology, and Endocrinology but have gaps in Anesthetics & ITU and Emergency Medicine.
- Model size does not always correlate with comprehensive medical knowledge.
- A psychometrically grounded approach reveals multidimensional evaluation of LLMs for clinical deployment.

## Abstract

With the rapid advancement of large language models (LLMs), efficiently and accurately evaluating their capabilities is essential for both developers and users. Unfortunately, most benchmarks evaluate the functionality of LLMs using average scores. This approach oversimplifies evaluation by overlooking nuanced performance differences across specific knowledge domains, failing to provide a comprehensive analysis of the models’ strengths and weaknesses. Safe clinical deployment of LLMs requires moving beyond simple accuracy scores to identify specific knowledge gaps. This study introduces an innovative interdisciplinary approach by integrating measurement theory and psychometric modeling into LLM research, bridging artificial intelligence with educational psychology. Based on 2,809 items from the test bank administrated by National Center for Health Professions Education Development, it employs a non-parametric cognitive diagnostic approach based on cognitive diagnostic assessment to evaluate 41 LLMs performance across 22 medical subdomains. The number of attributes mastered by the evaluated LLMs ranges from 17 to 20. Models with similar total scores can differ notably in their mastery of specific areas, highlighting strengths in some fields and gaps in others. Furthermore, factors such as model size does not always predict comprehensive medical knowledge. The LLMs demonstrate exceptional performance in several areas, achieving 100% mastery in 15 fields such as Cardiology, Dermatology, and Endocrinology etc., underscoring their strong medical knowledge. However, notable variations exist across certain domains. For instance, while Pharmacology and Neuroscience achieve high mastery proportions of 97.56%, Anesthetics & ITU and Emergency Medicine achieve lower proportions of 95.12%. Similarly, radiology has a mastery proportion of 87.80%, while ECG & hypertension & lipids and Liver Disorders show 0%, revealing substantial gaps in these specialized fields. This psychometrically-grounded approach provides multidimensional evaluation of LLMs, identifying specific competency gaps critical for clinical deployment. This methodology serves as an essential quality assurance tool for hospitals, developers, and regulators, enabling domain-specific validation to mitigate risks and ensure patient safety before clinical implementation.

## Full-text entities

- **Genes:** WNT2 (Wnt family member 2) [NCBI Gene 7472] {aka INT1L1, IRP}
- **Diseases:** LLMs (MESH:D007806), Liver Disorder (MESH:D017093), oncology (MESH:D000072716), CDMs (MESH:D003072), NPC (MESH:D052556), KS (MESH:D018458), hypertension (MESH:D006973), cancer (MESH:D009369)
- **Chemicals:** lipids (MESH:D008055), CDA (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12910072/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12910072/full.md

## References

2 references — full list in the complete paper: https://tomesphere.com/paper/PMC12910072/full.md

---
Source: https://tomesphere.com/paper/PMC12910072