A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI
Deep Bhatt, Surya Ayyagari, Anuruddh Mishra

TL;DR
This paper presents a scalable benchmarking framework for evaluating health AI chatbots' diagnostic accuracy using clinical vignettes, demonstrating its application with an AI chatbot that outperforms traditional symptom checkers.
Contribution
The study introduces a reproducible, scalable benchmarking methodology for health AI systems and applies it to evaluate an AI chatbot's diagnostic performance across multiple specialties.
Findings
AI chatbot achieved 81.8% top-one accuracy
System required 47% fewer questions than traditional symptom checkers
Chatbot demonstrated 95.8% accuracy in specialist referrals
Abstract
Diagnostic errors in healthcare persist as a critical challenge, with increasing numbers of patients turning to online resources for health information. While AI-powered healthcare chatbots show promise, there exists no standardized and scalable framework for evaluating their diagnostic capabilities. This study introduces a scalable benchmarking methodology for assessing health AI systems and demonstrates its application through August, an AI-driven conversational chatbot. Our methodology employs 400 validated clinical vignettes across 14 medical specialties, using AI-powered patient actors to simulate realistic clinical interactions. In systematic testing, August achieved a top-one diagnostic accuracy of 81.8% (327/400 cases) and a top-two accuracy of 85.0% (340/400 cases), significantly outperforming traditional symptom checkers. The system demonstrated 95.8% accuracy in specialist…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare · Artificial Intelligence in Healthcare and Education · Quality and Safety in Healthcare
