# Evaluating the accuracy of ChatGPT model versions for giving care-seeking advice

**Authors:** Marvin Kopka, Longqi He, Markus A. Feufel

PMC · DOI: 10.1038/s43856-026-01466-0 · 2026-02-25

## TL;DR

This study tests how well different versions of ChatGPT can give advice on when to seek medical care, finding that newer models aren't consistently better and accuracy remains insufficient for standalone use.

## Contribution

The study introduces a systematic evaluation of 22 ChatGPT models using validated patient scenarios to assess care-seeking advice accuracy and aggregation strategies.

## Key findings

- The best-performing model (o1-mini) achieved 74% accuracy in care-seeking advice.
- Newer models did not consistently outperform older ones but improved in identifying self-care cases.
- Aggregation strategies improved accuracy by up to 4 percentage points.

## Abstract

Artificial Intelligence tools such as ChatGPT are increasingly used by laypeople to support their care-seeking decisions, although the accuracy of newer models remains unclear. We aimed to evaluate the accuracy of care-seeking advice that is generated by all currently available ChatGPT models.

We evaluated 22 ChatGPT models using 45 validated vignettes, each prompted ten times (9,900 total assessments). Each model classified the vignettes as requiring emergency care, non-emergency care, or self-care. We evaluated accuracy against each case’s gold standard solution (determined by two physicians), examined the variability across trials, and tested algorithms to aggregate multiple recommendations to improve accuracy.

We show that o1-mini achieves the highest accuracy (74%), but we cannot observe an overall improvement with newer models – although reasoning models (e.g., o4-mini) improved their accuracy in identifying self-care cases. Selecting the lowest urgency level across multiple trials improves accuracy by 4 percentage points.

Although newer increasingly provide self-care advice, their accuracy remains insufficient for standalone use. However, making use of output variability with aggregation algorithms can improve the performance of existing models.

Many people use ChatGPT to decide whether they should go to the emergency room, see a doctor soon, or manage symptoms at home. We tested 22 model versions of ChatGPT with 45 real patient stories. Each story was asked ten times to test model consistency in suggesting one of three options: emergency care, non-emergency care, or self-care. The accuracy varied between all models; the best model gave correct recommendations in 74% of all cases. All models tended to advise more urgent care than needed, and they struggled most with self-care cases. Newer models were not generally better than older ones, although they gave correct self-care recommendations more often. When we combined a model’s consecutive answers to the same query, accuracy improved slightly. ChatGPT may help people recognize emergencies, but it is not reliable enough to guide care on its own. Further safety testing is needed.

Kopka et al. evaluate 22 ChatGPT models on care seeking advice using 45 validated patient vignettes, each asked ten times. Accuracy is moderate (with the best model identifying 74% of all cases), all models overtriage and face problems identifying self care cases, although various aggregation strategies can improve accuracy.

## Full-text entities

- **Genes:** GPT (glutamic--pyruvic transaminase) [NCBI Gene 2875] {aka AAT1, ALT, ALT1, GPT1, SGPT}
- **Diseases:** AI (MESH:C538142), obsessive-compulsive behavior (MESH:D009771), LLMs (MESH:D007806), anxiety (MESH:D001007)
- **Chemicals:** CoT (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13031804/full.md

---
Source: https://tomesphere.com/paper/PMC13031804