Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

David Fraile Navarro; Farah Magrabi; Enrico Coiera

arXiv:2603.11413·cs.HC·March 27, 2026

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

David Fraile Navarro, Farah Magrabi, Enrico Coiera

PDF

Open Access

TL;DR

This study shows that the way consumer health AI is evaluated, especially the use of forced-choice formats, significantly affects perceived triage accuracy, highlighting the need for more realistic testing conditions.

Contribution

The paper demonstrates that evaluation format biases triage performance metrics and advocates for testing AI in naturalistic, user-like interactions to better assess real-world safety.

Findings

01

Naturalistic interactions improve triage accuracy by 6.4 percentage points.

02

Forced-choice evaluation underestimates true triage performance.

03

Evaluation format critically influences perceived AI safety and effectiveness.

Abstract

Ramaswamy et al. reported in Nature Medicine that ChatGPT Health under-triages 51.6% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors' released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points ( $p = 0.015$ ). Diabetic ketoacidosis was correctly triaged in 100% of trials across all models and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Digital Mental Health Interventions · Misinformation and Its Impacts