Comparative Performance of ChatGPT-5 and Gemini 2.5 on the Official Clinical Diabetology Specialty Examination

Anna Kowalczyk; Michalina Loson-Kawalec; Dawid Bartosik; Aleksander Tabor; Bartosz Starzynski; Patrycja Dadynska; Piotr Sawina; Marta Zerek; Gracjan Sitarek; Dawid Szymanski; Maciej Majchrzak; Mateusz Podkanowicz; Alicja Szalach; Katarzyna Romanowicz; Tomasz Dolata; Adrianna Pielech

PMC · DOI:10.7759/cureus.101731·January 17, 2026

Comparative Performance of ChatGPT-5 and Gemini 2.5 on the Official Clinical Diabetology Specialty Examination

Anna Kowalczyk, Michalina Loson-Kawalec, Dawid Bartosik, Aleksander Tabor, Bartosz Starzynski, Patrycja Dadynska, Piotr Sawina, Marta Zerek, Gracjan Sitarek, Dawid Szymanski, Maciej Majchrzak, Mateusz Podkanowicz, Alicja Szalach, Katarzyna Romanowicz, Tomasz Dolata

PDF

Open Access

TL;DR

This study compares how well ChatGPT-5 and Gemini 2.5 perform on a clinical diabetology exam in Poland, showing both can pass but still need improvement for real-world use.

Contribution

First comparison of ChatGPT-5 and Gemini 2.5 on an official clinical diabetology exam in a non-English language setting.

Findings

01

ChatGPT-5 achieved 78.63% accuracy, while Gemini 2.5 scored 68.38% on the exam questions.

02

Both models showed high confidence in their answers, with scores of 5 and 4.957 on a five-point scale.

03

Statistically significant differences were found between the two models' performance (χ² = 4.84; p = 0.0455).

Abstract

Introduction Artificial intelligence (AI) is becoming progressively popular in so many parts of our lives, and medicine is not an exception. Many people from the medical and non-medical environments ask themselves, "Will it be better than the doctors?". To become a doctor in Poland, you have to pass many exams, which also include a specialization exam. How can AI cope with the Official Clinical Diabetology Specialty Test? Objective The purpose of this article was to show how AI tools, such as ChatGPT-5 (OpenAI, San Francisco, CA, USA) and Gemini 2.5 (Google DeepMind, London, UK), are handling the Official Clinical Diabetology Specialty Test of Poland. The first outcome was to compare the accuracy of the answers with the official answer key. Secondly, the result was the confidence of models in their responses. Both of the models were statistically compared using McNemar's test.…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes2

GLP1R INS

Proteins2

Species1

Homo sapiens(human · species)

Chemicals32

Diseases35

Mutations2

inserted at position 21inserted at position 7

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Mobile Health and mHealth Applications · Clinical Reasoning and Diagnostic Skills