Evaluating the accuracy of ChatGPT model versions for giving care-seeking advice
Marvin Kopka, Longqi He, Markus A. Feufel

TL;DR
This study tests how well different versions of ChatGPT can give advice on when to seek medical care, finding that newer models aren't consistently better and accuracy remains insufficient for standalone use.
Contribution
The study introduces a systematic evaluation of 22 ChatGPT models using validated patient scenarios to assess care-seeking advice accuracy and aggregation strategies.
Findings
The best-performing model (o1-mini) achieved 74% accuracy in care-seeking advice.
Newer models did not consistently outperform older ones but improved in identifying self-care cases.
Aggregation strategies improved accuracy by up to 4 percentage points.
Abstract
Artificial Intelligence tools such as ChatGPT are increasingly used by laypeople to support their care-seeking decisions, although the accuracy of newer models remains unclear. We aimed to evaluate the accuracy of care-seeking advice that is generated by all currently available ChatGPT models. We evaluated 22 ChatGPT models using 45 validated vignettes, each prompted ten times (9,900 total assessments). Each model classified the vignettes as requiring emergency care, non-emergency care, or self-care. We evaluated accuracy against each case’s gold standard solution (determined by two physicians), examined the variability across trials, and tested algorithms to aggregate multiple recommendations to improve accuracy. We show that o1-mini achieves the highest accuracy (74%), but we cannot observe an overall improvement with newer models – although reasoning models (e.g., o4-mini) improved…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills · Digital Mental Health Interventions
