Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis
Richard Armitage

TL;DR
Leading large language models as of May 2025 demonstrate exceptional performance on primary care exam questions, surpassing average GP scores and highlighting their potential to support clinical practice and education.
Contribution
This study evaluates the performance of the latest leading LLMs on primary care exam questions, revealing their high accuracy and potential for supporting medical education and clinical decision-making.
Findings
All models significantly outperformed average GPs and registrars.
o3 achieved the highest score at 99%.
Models performed well across textual, laboratory, and image-based questions.
Abstract
Background: Large language models (LLMs) have demonstrated substantial potential to support clinical practice. Other than Chat GPT4 and its predecessors, few LLMs, especially those of the leading and more powerful reasoning model class, have been subjected to medical specialty examination questions, including in the domain of primary care. This paper aimed to test the capabilities of leading LLMs as of May 2025 (o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro) in primary care education, specifically in answering Member of the Royal College of General Practitioners (MRCGP) style examination questions. Methods: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen multiple choice questions from the Royal College of General Practitioners GP SelfTest on 25 May 2025. Questions included textual information, laboratory results, and clinical images. Each model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterpreting and Communication in Healthcare
MethodsGreedy Policy Search
