Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI
David Fraile Navarro, Farah Magrabi, Enrico Coiera

TL;DR
This study shows that the way consumer health AI is evaluated, especially the use of forced-choice formats, significantly affects perceived triage accuracy, highlighting the need for more realistic testing conditions.
Contribution
The paper demonstrates that evaluation format biases triage performance metrics and advocates for testing AI in naturalistic, user-like interactions to better assess real-world safety.
Findings
Naturalistic interactions improve triage accuracy by 6.4 percentage points.
Forced-choice evaluation underestimates true triage performance.
Evaluation format critically influences perceived AI safety and effectiveness.
Abstract
Ramaswamy et al. reported in Nature Medicine that ChatGPT Health under-triages 51.6% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors' released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points (). Diabetic ketoacidosis was correctly triaged in 100% of trials across all models and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Digital Mental Health Interventions · Misinformation and Its Impacts
