Comparative Performance of ChatGPT-5 and Gemini 2.5 on the Official Clinical Diabetology Specialty Examination
Anna Kowalczyk, Michalina Loson-Kawalec, Dawid Bartosik, Aleksander Tabor, Bartosz Starzynski, Patrycja Dadynska, Piotr Sawina, Marta Zerek, Gracjan Sitarek, Dawid Szymanski, Maciej Majchrzak, Mateusz Podkanowicz, Alicja Szalach, Katarzyna Romanowicz, Tomasz Dolata

TL;DR
This study compares how well ChatGPT-5 and Gemini 2.5 perform on a clinical diabetology exam in Poland, showing both can pass but still need improvement for real-world use.
Contribution
First comparison of ChatGPT-5 and Gemini 2.5 on an official clinical diabetology exam in a non-English language setting.
Findings
ChatGPT-5 achieved 78.63% accuracy, while Gemini 2.5 scored 68.38% on the exam questions.
Both models showed high confidence in their answers, with scores of 5 and 4.957 on a five-point scale.
Statistically significant differences were found between the two models' performance (χ² = 4.84; p = 0.0455).
Abstract
Introduction Artificial intelligence (AI) is becoming progressively popular in so many parts of our lives, and medicine is not an exception. Many people from the medical and non-medical environments ask themselves, "Will it be better than the doctors?". To become a doctor in Poland, you have to pass many exams, which also include a specialization exam. How can AI cope with the Official Clinical Diabetology Specialty Test? Objective The purpose of this article was to show how AI tools, such as ChatGPT-5 (OpenAI, San Francisco, CA, USA) and Gemini 2.5 (Google DeepMind, London, UK), are handling the Official Clinical Diabetology Specialty Test of Poland. The first outcome was to compare the accuracy of the answers with the official answer key. Secondly, the result was the confidence of models in their responses. Both of the models were statistically compared using McNemar's test.…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Mobile Health and mHealth Applications · Clinical Reasoning and Diagnostic Skills
