# GPT-4o and OpenAI o1 Performance on the 2024 Spanish Competitive Medical Specialty Access Examination: Cross-Sectional Quantitative Evaluation Study

**Authors:** Pau Benito, Mikel Isla-Jover, Pablo González-Castro, Pedro José Fernández Esparcia, Manuel Carpio, Iván Blay-Simón, Pablo Gutiérrez-Bedia, Maria J Lapastora, Beatriz Carratalá, Carlos Carazo-Casas

PMC · DOI: 10.2196/75452 · JMIR Medical Education · 2026-01-12

## TL;DR

This study evaluated how well GPT-4o and OpenAI o1 performed on a Spanish medical licensing exam, showing they outperformed average candidates and performed consistently across subjects.

## Contribution

The study is the first to evaluate GPT-4o and OpenAI o1 on the MIR 2024 exam, comparing their accuracy to medical specialists and candidates.

## Key findings

- GPT-4o and OpenAI o1 achieved 89.8% and 92.6% accuracy, respectively, outperforming the average MIR candidate (56.6%).
- Both models showed high response consistency and performed better on clinical cases and positive questions.
- LLMs' accuracy declined with increasing question difficulty but remained above 80% in most medical subjects.

## Abstract

In recent years, generative artificial intelligence and large language models (LLMs) have rapidly advanced, offering significant potential to transform medical education. Several studies have evaluated the performance of chatbots on multiple-choice medical examinations.

The study aims to assess the performance of two LLMs—GPT-4o and OpenAI o1—on the Médico Interno Residente (MIR) 2024 examination, the Spanish national medical test that determines eligibility for competitive medical specialist training positions.

A total of 176 questions from the MIR 2024 examination were analyzed. Each question was presented individually to the chatbots to ensure independence and prevent memory retention bias. No additional prompts were introduced to minimize potential bias. For each LLM, response consistency under verification prompting was assessed by systematically asking, “Are you sure?” after each response. Accuracy was defined as the percentage of correct responses compared to the official answers provided by the Spanish Ministry of Health. It was assessed for GPT-4o, OpenAI o1, and, as a benchmark, for a consensus of medical specialists and for the average MIR candidate. Subanalyses included performance across different medical subjects, question difficulty (quintiles based on the percentage of examinees correctly answering each question), and question types (clinical cases vs theoretical questions; positive vs negative questions).

Overall accuracy was 89.8% (158/176) for GPT-4o and 90% (160/176) after verification prompting, 92.6% (163/176) for OpenAI o1 and 93.2% (164/176) after verification prompting, 94.3% (166/176) for the consensus of medical specialists, and 56.6% (100/176) for the average MIR candidate. Both LLMs and the consensus of medical specialists outperformed the average MIR candidate across all 20 medical subjects analyzed, with ≥80% LLMs’ accuracy in most domains. A performance gradient was observed: LLMs’ accuracy gradually declined as question difficulty increased. Slightly higher accuracy was observed for clinical cases compared to theoretical questions, as well as for positive questions compared to negative ones. Both models demonstrated high response consistency, with near-perfect agreement between initial responses and those after the verification prompting.

These findings highlight the excellent performance of GPT-4o and OpenAI o1 on the MIR 2024 examination, demonstrating consistent accuracy across medical subjects and question types. The integration of LLMs into medical education presents promising opportunities and is likely to reshape how students prepare for licensing examinations and change our understanding of medical education. Further research should explore how the wording, language, prompting techniques, and image-based questions can influence LLMs’ accuracy, as well as evaluate the performance of emerging artificial intelligence models in similar assessments.

## Full-text entities

- **Genes:** GPT (glutamic--pyruvic transaminase) [NCBI Gene 2875] {aka AAT1, ALT, ALT1, GPT1, SGPT}, MARCHF8 (membrane associated ring-CH-type finger 8) [NCBI Gene 220972] {aka CMIR, MARCH-VIII, MARCH8, MIR, RNF178, c-MIR}
- **Diseases:** LLMs (MESH:D007806), USMLE (MESH:D000069279), infectious diseases (MESH:D003141)
- **Chemicals:** 4o (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12795474/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12795474/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/PMC12795474/full.md

---
Source: https://tomesphere.com/paper/PMC12795474