# Comparative Performance of ChatGPT-5 and Gemini 2.5 on the Official Clinical Diabetology Specialty Examination

**Authors:** Anna Kowalczyk, Michalina Loson-Kawalec, Dawid Bartosik, Aleksander Tabor, Bartosz Starzynski, Patrycja Dadynska, Piotr Sawina, Marta Zerek, Gracjan Sitarek, Dawid Szymanski, Maciej Majchrzak, Mateusz Podkanowicz, Alicja Szalach, Katarzyna Romanowicz, Tomasz Dolata, Adrianna Pielech

PMC · DOI: 10.7759/cureus.101731 · 2026-01-17

## TL;DR

This study compares how well ChatGPT-5 and Gemini 2.5 perform on a clinical diabetology exam in Poland, showing both can pass but still need improvement for real-world use.

## Contribution

First comparison of ChatGPT-5 and Gemini 2.5 on an official clinical diabetology exam in a non-English language setting.

## Key findings

- ChatGPT-5 achieved 78.63% accuracy, while Gemini 2.5 scored 68.38% on the exam questions.
- Both models showed high confidence in their answers, with scores of 5 and 4.957 on a five-point scale.
- Statistically significant differences were found between the two models' performance (χ² = 4.84; p = 0.0455).

## Abstract

Introduction

Artificial intelligence (AI) is becoming progressively popular in so many parts of our lives, and medicine is not an exception. Many people from the medical and non-medical environments ask themselves, "Will it be better than the doctors?". To become a doctor in Poland, you have to pass many exams, which also include a specialization exam. How can AI cope with the Official Clinical Diabetology Specialty Test?

Objective

The purpose of this article was to show how AI tools, such as ChatGPT-5 (OpenAI, San Francisco, CA, USA) and Gemini 2.5 (Google DeepMind, London, UK), are handling the Official Clinical Diabetology Specialty Test of Poland. The first outcome was to compare the accuracy of the answers with the official answer key. Secondly, the result was the confidence of models in their responses. Both of the models were statistically compared using McNemar's test.

Methodology

The study analyzed 117 questions randomly chosen from the Centre for Medical Examination (CEM) archive in Łódź, Poland. The questions were multiple-choiced with one correct answer. ChatGPT-5 and Gemini 2.5 were responding to these questions in the Polish language. The assessment was based on the official key advertised on the CEM. Statistical analysis was executed using McNemar's test.

Results

ChatGPT-5 answered correctly for 92 questions (78.63%). On the other hand, Gemini 2.5 achieved an accuracy rate of 68.38%, because it gave correct answers in 80 of 117 questions. Collations of these two models received statistically relevant differences (χ² = 4.84; p = 0.0455). In an analysis of confidence in giving the answers, both models were on a similar level (GPT-5 = 5, while Gemini 2.5 = 4.957 on a five-point scale).

Conclusions

Scores, which reached both of the AI models, ChatGPT-5 and Gemini 2.5, allowed for passing the Official Clinical Diabetology Specialty Examination. The results are hopeful, because we can use AI in the study process, but it still has to be more adapted to manage clinical cases. We can use it as a tool in learning, but for now, it cannot step in for specialists.

## Full-text entities

- **Genes:** GLP1R (glucagon like peptide 1 receptor) [NCBI Gene 2740] {aka GLP-1, GLP-1-R, GLP-1R}, INS (insulin) [NCBI Gene 3630] {aka IDDM, IDDM1, IDDM2, ILPR, IRDN, MODY10}
- **Diseases:** overweight (MESH:D050177), LLMs (MESH:D007806), AI (MESH:C538142), nausea (MESH:D009325), Obesity (MESH:D009765), systemic lupus erythematosus (MESH:D008180), Vomiting (MESH:D014839), hypotensive drugs (MESH:D007022), glucosuria (MESH:D006030), psoriasis (MESH:D011565), proteinuria (MESH:D011507), iron deficiency anemia (MESH:D018798), pain (MESH:D010146), dyslipidemia (MESH:D050171), alcohol abuse (MESH:D000437), hyperglycemia (MESH:D006943), abdominal pain (MESH:D015746), chronic kidney disease (MESH:D051436), swelling (MESH:D004487), Diabetes (MESH:D003920), type 1 diabetes (MESH:D003922), anorexia nervosa (MESH:D000856), Type 2 diabetes (MESH:D003924), heart failure (MESH:D006333), bloating (MESH:C535647), ketoacidosis (MESH:D007662), hypertension (MESH:D006973), hypothyroidism (MESH:D007037), diabetic foot syndrome (MESH:D017719), hallucinations (MESH:D006212), chronic pancreatitis (MESH:D050500), Hypoglycemia (MESH:D007003), Diabetic neuropathy (MESH:D003929), end-stage renal disease (MESH:D007676), endocrine diseases can be diabetes. Endocrine system diseases (MESH:D004700)
- **Chemicals:** eplerenone (MESH:D000077545), Sulfonylurea (MESH:D013453), Empagliflozin (MESH:C570240), Acetylsalicylic acid (MESH:D001241), Metformin (MESH:D008687), cholesterol (MESH:D002784), thiazide (MESH:D049971), Blood glucose (MESH:D001786), ascorbic acid (MESH:D001205), lispro (MESH:D061268), triglycerides (MESH:D014280), torsemide (MESH:D000077786), Duloxetine (MESH:D000068736), fibrate (MESH:D058607), Pioglitazone (MESH:D000077205), mannitol (MESH:D008353), Glucose (MESH:D005947), creatinine (MESH:D003404), hydroxyurea (MESH:D006918), alcohol (MESH:D000438), Atorvastatin (MESH:D000069059), TG (MESH:D013866), K (MESH:D011188), Degludec (MESH:C571886), perindopril (MESH:D020913), tetracycline (MESH:D013752), Glargine (MESH:D000069036), FT4 (-), bisoprolol (MESH:D017298), Glimepiride (MESH:C057619), Dapagliflozin (MESH:C529054), hydrochlorothiazide (MESH:D006852)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Mutations:** inserted at position 21, inserted at position 7

---
Source: https://tomesphere.com/paper/PMC12909287