# Comparative Analysis of Large Language Models in First-Aid Scenario Recognition and Management: An In Silico Evaluation of ChatGPT and Claude

**Authors:** Norvin K West, Ajani J Edwards, Jessica K Sims, Jordan E O'Brien, Jeffrey S Upperman

PMC · DOI: 10.7759/cureus.94229 · Cureus · 2025-10-09

## TL;DR

This study compares ChatGPT and Claude in providing first-aid guidance, finding that Claude performs better in accuracy and consistency.

## Contribution

The novel contribution is an in silico evaluation of LLMs for first-aid scenarios using standardized vignettes and a scoring framework.

## Key findings

- Claude 3.5 outperformed GPT-4o in first-aid accuracy, comprehensiveness, and consistency.
- Both models achieved perfect diagnostic and triage scores across all scenarios.
- GPT-4o missed critical steps like naloxone administration and sheltering after lightning strikes.

## Abstract

Introduction: Large language models (LLMs) deliver real-time, conversational guidance, yet their reliability for time-critical first aid remains unclear.

Materials and methods: Five standardized vignettes (drowning, animal bite, opioid overdose, lightning strike, and frostbite) were presented three times each to GPT-4o (OpenAI, San Francisco, CA, USA) and Claude 3.5 Sonnet (Anthropic, San Francisco, CA, USA). Outputs were scored (0 = incorrect/unsafe, 1 = incomplete, 2 = entirely correct) across six domains: diagnostic accuracy, first-aid advice, triage accuracy, comprehensiveness, safety, and consistency. Scores were averaged within and across vignettes.

Results: Both LLMs achieved perfect diagnostic (2.0) and triage (2.0) scores. Claude 3.5 outperformed GPT-4o in first-aid accuracy (2.0 vs 1.5), comprehensiveness (1.5 vs 1.3), and consistency (2.0 vs 1.6). Safety ratings were comparable (1.9-2.0). Key GPT-4 omissions included naloxone administration for opioid overdose and immediate sheltering guidance after a lightning strike.

Conclusions: Claude 3.5 provided more complete and stable first-aid guidance than GPT-4, although both models reliably identified emergencies and advised on the appropriate escalation of care. Wider implementation warrants larger vignette sets, real-user simulations, and continuous monitoring for guideline concordance.

## Linked entities

- **Chemicals:** naloxone (PubChem CID 4425)

## Full-text entities

- **Diseases:** opioid overdose (MESH:D000083682)
- **Chemicals:** naloxone (MESH:D009270)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12597125/full.md

## References

11 references — full list in the complete paper: https://tomesphere.com/paper/PMC12597125/full.md

---
Source: https://tomesphere.com/paper/PMC12597125