MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences
Eric Wu, Kevin Wu, Jason Hom, Paul H. Yi, Angela Zhang, Alejandro Lozano, Jeff Nirschl, Jeff Tangney, Kevin Byram, Braydon Dymm, Narender Annapureddy, Eric Topol, David Ouyang, and James Zou

TL;DR
MedArena is an interactive platform that evaluates medical large language models based on clinician preferences using real-world queries, revealing model rankings and insights into what clinicians value beyond factual accuracy.
Contribution
This work introduces MedArena, a novel interactive evaluation platform that captures clinician preferences on medical LLM responses using real clinical questions, addressing limitations of static benchmarks.
Findings
Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o ranked highest among 12 models.
Clinicians prioritize response depth, clarity, and clinical nuance over factual recall.
Model rankings are stable even after controlling for style and formatting factors.
Abstract
Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling
