MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

Eric Wu; Kevin Wu; Jason Hom; Paul H. Yi; Angela Zhang; Alejandro Lozano; Jeff Nirschl; Jeff Tangney; Kevin Byram; Braydon Dymm; Narender Annapureddy; Eric Topol; David Ouyang; and James Zou

arXiv:2603.15677·cs.CL·March 18, 2026

MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

Eric Wu, Kevin Wu, Jason Hom, Paul H. Yi, Angela Zhang, Alejandro Lozano, Jeff Nirschl, Jeff Tangney, Kevin Byram, Braydon Dymm, Narender Annapureddy, Eric Topol, David Ouyang, and James Zou

PDF

Open Access

TL;DR

MedArena is an interactive platform that evaluates medical large language models based on clinician preferences using real-world queries, revealing model rankings and insights into what clinicians value beyond factual accuracy.

Contribution

This work introduces MedArena, a novel interactive evaluation platform that captures clinician preferences on medical LLM responses using real clinical questions, addressing limitations of static benchmarks.

Findings

01

Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o ranked highest among 12 models.

02

Clinicians prioritize response depth, clarity, and clinical nuance over factual recall.

03

Model rankings are stable even after controlling for style and formatting factors.

Abstract

Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling