Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Krithik Vishwanath; Mrigayu Ghosh; Anton Alyakin; Daniel Alexander Alber; Yindalon Aphinyanaphongs; Eric Karl Oermann

arXiv:2512.01191·cs.CL·December 2, 2025

Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Krithik Vishwanath, Mrigayu Ghosh, Anton Alyakin, Daniel Alexander Alber, Yindalon Aphinyanaphongs, Eric Karl Oermann

PDF

Open Access

TL;DR

This study compares clinical AI assistants and generalist large language models on medical benchmarks, finding that state-of-the-art LLMs outperform specialized clinical tools in accuracy and communication quality.

Contribution

It provides the first independent, quantitative evaluation showing that generalist LLMs surpass clinical AI tools on medical benchmarks, highlighting the need for rigorous assessment.

Findings

01

GPT-5 achieved the highest scores among models.

02

Clinical tools showed deficits in completeness and safety reasoning.

03

Generalist LLMs outperformed clinical AI systems on benchmarks.

Abstract

Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Clinical Reasoning and Diagnostic Skills