# Benchmark evaluation of large language models for clinical decision support in headache management

**Authors:** Shi Chen, Dong Liang, Xu Qiu, Chengqi Dong, Jiayi Deng, Li Xu, Xiaoxue Dong, Yonglei Zhao, Xuemei Fan, Xiaoyu Liu, Yali Wu, Jianliang Sun, Feifang He, Ke Ma, Liang Yu, Hanbin Wang

PMC · DOI: 10.22514/jofph.2026.029 · 2026-03-12

## TL;DR

This study evaluates how well large language models assist in diagnosing and managing headaches, finding that while some models perform better in certain areas, none match expert-level accuracy for clinical use.

## Contribution

The study introduces a structured benchmark for evaluating LLMs in headache management, comparing models and prompting strategies.

## Key findings

- ChatGPT-4o outperformed Grok-3 in diagnostic accuracy with the ask-in-sequence strategy.
- Grok-3 and DeepSeek-R1 showed higher supplementary value depending on the prompting strategy.
- Readability varied significantly, with Gemini 2.5 Pro having the best readability across strategies.

## Abstract

Background: Headache disorders are a major cause of disability 
worldwide. In routine practice, diagnosis and guideline-based management are 
difficult because symptoms can overlap between primary and secondary headaches, 
and clinicians must combine clinical, imaging, and pathological information. 
Large language models (LLMs) are being proposed to assist clinical reasoning, but 
their performance on headache cases and their sensitivity to prompting have not 
been systematically assessed. Methods: We evaluated seven leading LLMs 
using 13 headache cases from the New England Journal of Medicine (NEJM). We 
compared two prompting strategies: ask-in-sequence (AS) and ask-at-once (AO). 
Using a 5-point Likert rubric, three headache specialists independently scored 
six dimensions: rationality of diagnostic thinking, comprehensiveness of 
differential diagnosis, diagnostic accuracy, completeness of pathological 
diagnosis, clinical management, and supplementary value. Readability was measured 
with Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL). We analyzed 
differences across models, prompting strategies, and cases. Results: 
Diagnostic accuracy differed by model: in the AS strategy, ChatGPT-4o 
outperformed Grok-3. Supplementary value also varied: in AS, Grok-3 outperformed 
ChatGPT-5 and Hunyuan-T1; in AO, DeepSeek-R1 outperformed ChatGPT-5. Overall, 
supplementary value was generally higher with AS, while strategy-related 
differences in diagnostic accuracy were observed only for Grok-3. Performance 
also depended on the case; C8 and C11 consistently received very low scores, 
suggesting difficulty integrating psychiatric or warning signs with pathological 
findings. Readability differed significantly: Gemini 2.5 Pro had the highest FRE 
(best readability) across strategies, and AS outputs generally had higher FRE. 
Within AS, ChatGPT-4o had the highest FKGL (worst readability). No significant 
model differences were found for the other four clinical dimensions. 
Conclusions: This study provides a structured, reproducible evaluation 
of LLMs on headache case analysis. While some models improved supplementary 
value, diagnostic accuracy, or readability, overall clinical accuracy remains 
below expert performance and is not sufficient for unsupervised clinical use.

## Full-text entities

- **Diseases:** headache (MESH:D006261), primary headaches (MESH:D051270), LLMs (MESH:D007806), central nervous system inflammation (MESH:D007249), fever (MESH:D005334), red flag blindness (MESH:D003117), fatigue (MESH:D005221), psychiatric (MESH:D001523), intracranial hemorrhage (MESH:D020300), Headache Disorders (MESH:D020773), tension-type headache (MESH:D018781), primary and secondary headaches (MESH:D051271), migraine (MESH:D008881), AS (MESH:D010855), hallucinations (MESH:D006212), NEJM (MESH:D009134)
- **Chemicals:** AO (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13036623/full.md

---
Source: https://tomesphere.com/paper/PMC13036623