Evaluating the accuracy of ChatGPT model versions for giving care-seeking advice

Marvin Kopka; Longqi He; Markus A. Feufel

PMC · DOI:10.1038/s43856-026-01466-0·February 25, 2026

Evaluating the accuracy of ChatGPT model versions for giving care-seeking advice

Marvin Kopka, Longqi He, Markus A. Feufel

PDF

Open Access

TL;DR

This study tests how well different versions of ChatGPT can give advice on when to seek medical care, finding that newer models aren't consistently better and accuracy remains insufficient for standalone use.

Contribution

The study introduces a systematic evaluation of 22 ChatGPT models using validated patient scenarios to assess care-seeking advice accuracy and aggregation strategies.

Findings

01

The best-performing model (o1-mini) achieved 74% accuracy in care-seeking advice.

02

Newer models did not consistently outperform older ones but improved in identifying self-care cases.

03

Aggregation strategies improved accuracy by up to 4 percentage points.

Abstract

Artificial Intelligence tools such as ChatGPT are increasingly used by laypeople to support their care-seeking decisions, although the accuracy of newer models remains unclear. We aimed to evaluate the accuracy of care-seeking advice that is generated by all currently available ChatGPT models. We evaluated 22 ChatGPT models using 45 validated vignettes, each prompted ten times (9,900 total assessments). Each model classified the vignettes as requiring emergency care, non-emergency care, or self-care. We evaluated accuracy against each case’s gold standard solution (determined by two physicians), examined the variability across trials, and tested algorithms to aggregate multiple recommendations to improve accuracy. We show that o1-mini achieves the highest accuracy (74%), but we cannot observe an overall improvement with newer models – although reasoning models (e.g., o4-mini) improved…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

GPT

Proteins1

Species1

Homo sapiens(human · species)

Chemicals1

CoT

Diseases4

AI obsessive-compulsive behavior LLMs anxiety

Figures5

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills · Digital Mental Health Interventions