# Comparison of GPT-4o With Human Performance in the Polish Vascular Surgery Specialty Examination

**Authors:** Michalina Loson-Kawalec, Anna Kowalczyk, Aleksander Tabor, Patrycja Dadynska, Aleksandra Wielochowska, Dawid Boczkowski, Tomasz Dolata, Weronika Majchrowicz, Piotr Sawina, Dominika Radej, Maja Kruplewicz, Dawid Bartosik, Marta Zerek, Alicja Szalach, Gracjan Sitarek, Ada Latkowska

PMC · DOI: 10.7759/cureus.93022 · Cureus · 2025-09-23

## TL;DR

This study compares the performance of the ChatGPT-4o AI model with human standards on a Polish vascular surgery exam, finding that it passes but has limitations in self-assessment.

## Contribution

The study evaluates ChatGPT-4o's performance on a real Polish medical exam in vascular surgery, comparing its accuracy and confidence with human benchmarks.

## Key findings

- ChatGPT-4o answered 73.3% of questions correctly, surpassing the minimum passing score.
- The model's confidence levels did not reliably predict its accuracy in answers.
- Performance was similar for clinical and non-clinical questions.

## Abstract

Background

Artificial intelligence (AI) offers many possibilities by using language models such as ChatGPT, also known as a synthetic generative intelligence chatbot advanced by OpenAI (OpenAI, Inc., San Francisco, USA). Using the potential of AI in medicine can provide a crucial tool to assess medical expertise and create a promising future in the field of medical education. Prior investigations have documented the progressive advancing performance of AI systems in addressing medical situations. These studies were also conducted in the evaluation of Polish medical examinations, comprising the State Specialization Examination (PES) as discussed in this article. These findings have stimulated scholarly debate regarding the potential of such technologies to serve as instruments for enhancing postgraduate specialist education and training.

Objective

This study aimed to evaluate the performance of the ChatGPT-4o model in solving the PES in the field of vascular surgery. The analysis examined both the correctness of the answers and the model’s stated confidence, with the goal of understanding its potential value in education.

Methods

This study was developed using the official PES in vascular surgery from a previous session, namely, the Spring 2025 edition, comprising 120 multiple-choice items. The ChatGPT-4o model was acquainted with the examination regulations beforehand, and all items were presented in the Polish language. Response accuracy was evaluated against the database of correct answers of the Medical Examination Center (CEM) in Łódź and also included the model’s self-reported confidence rating on a five-point scale. Statistical analyses were conducted using the chi-square test to compare categorical variables and the Mann-Whitney U test to assess differences between non-normally distributed continuous variables.

Results

ChatGPT-4o achieved 88 correct answers (73.3%), thereby surpassing the minimum passing criterion for the examination. There was no apparent distinction in the efficacy of clinical and non-clinical questions (p=0.561). The model's self-reported confidence levels did not largely correlate with its response accuracy. Such discrepancies show that, while ChatGPT can imply its doubts, it is not able to consistently predict performance, highlighting ongoing limitations in the model’s self-assessment capabilities.

Conclusions

ChatGPT-4o demonstrated satisfactory results on the PES vascular surgery exam, highlighting AI’s promise in specialist education, particularly as a support for learning a special field of medicine with specific conditions. It is crucial to treat ChatGPT as a supporting educational tool, not exclusively used by one source of knowledge. These findings indicate that advanced AI models may serve as valuable tools in a specialist field of education. Nonetheless, careful oversight by medical professionals and additional validation studies across various medical fields are necessary before AI models can be widely implemented in medical education.

## Full-text entities

- **Chemicals:** GPT-4o (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12548761/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12548761/full.md

## References

11 references — full list in the complete paper: https://tomesphere.com/paper/PMC12548761/full.md

---
Source: https://tomesphere.com/paper/PMC12548761