Performance of 5 AI Models on United States Medical Licensing Examination Step 1 Questions: Comparative Observational Study

Dania El Natour; Mohamad Abou Alfa; Ahmad Chaaban; Reda Assi; Toufic Dally; Bahaa Bou Dargham

PMC · DOI:10.2196/76928·March 9, 2026

Performance of 5 AI Models on United States Medical Licensing Examination Step 1 Questions: Comparative Observational Study

Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham

PDF

Open Access

TL;DR

This study compared five AI models on USMLE Step 1 questions, finding Grok had the highest accuracy and consistency, especially with images and complex questions.

Contribution

The study provides a comparative evaluation of newer AI models on USMLE Step 1 questions, highlighting their strengths in different medical domains and question formats.

Findings

01

Grok achieved the highest score (91.6%) and excelled in image-based and case-based questions.

02

DeepSeek scored lowest initially but matched others on text-only questions.

03

Models performed best in biostatistics and worst in musculoskeletal topics.

Abstract

Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on United States Medical Licensing Examination (USMLE)–style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats. This study aimed to evaluate and compare the performance of 5 publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 free 120-question set, assessing their accuracy and consistency across question types and medical subjects. This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding 1 audio-based item) was presented to each AI model by using a…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

Gemini

Diseases5

AI NBME skin lesions LLMs ChatGPT-4

Figures2

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Intelligent Tutoring Systems and Adaptive Learning