Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis

Richard Armitage

arXiv:2506.02987·cs.CL·June 4, 2025

Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis

Richard Armitage

PDF

Open Access

TL;DR

Leading large language models as of May 2025 demonstrate exceptional performance on primary care exam questions, surpassing average GP scores and highlighting their potential to support clinical practice and education.

Contribution

This study evaluates the performance of the latest leading LLMs on primary care exam questions, revealing their high accuracy and potential for supporting medical education and clinical decision-making.

Findings

01

All models significantly outperformed average GPs and registrars.

02

o3 achieved the highest score at 99%.

03

Models performed well across textual, laboratory, and image-based questions.

Abstract

Background: Large language models (LLMs) have demonstrated substantial potential to support clinical practice. Other than Chat GPT4 and its predecessors, few LLMs, especially those of the leading and more powerful reasoning model class, have been subjected to medical specialty examination questions, including in the domain of primary care. This paper aimed to test the capabilities of leading LLMs as of May 2025 (o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro) in primary care education, specifically in answering Member of the Royal College of General Practitioners (MRCGP) style examination questions. Methods: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen multiple choice questions from the Royal College of General Practitioners GP SelfTest on 25 May 2025. Questions included textual information, laboratory results, and clinical images. Each model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInterpreting and Communication in Healthcare

MethodsGreedy Policy Search