# Comparative performance of ChatGPT, Gemini, and Deepseek on endodontic exam questions in Turkish and English

**Authors:** Eda Gürsu Şahin

PMC · DOI: 10.1186/s12903-026-07753-5 · BMC Oral Health · 2026-02-04

## TL;DR

This study compares how well three AI models answer endodontics exam questions in Turkish and English, finding language and model differences in performance.

## Contribution

The study evaluates the comparative performance of three LLMs on endodontic exams in two languages, revealing language-specific and model-specific trends.

## Key findings

- All models performed better in English than in Turkish for correct answers and explanations.
- DeepSeek-R1 and Gemini 2.0 outperformed ChatGPT-4 in Turkish for correct answers and explanations.
- Models performed significantly better on Simple-style questions compared to Combination-style questions in both languages.

## Abstract

Large language model-based artificial intelligence (LLM-based AI) applications have become a focal point in the healthcare field. This study aimed to compare the performance of ChatGPT-4, Gemini 2.0 and DeepSeek-R1 in answering endodontics questions from the dentistry specialty examination in both Turkish and English.

A total of 130 multiple-choice Endodontics questions from the dentistry specialty examination question pool were presented to LLMs developed by OpenAI (ChatGPT-4), Google (Gemini 2.0) and DeepSeek (DeepSeek-R1). The questions were entered into each model under standardized conditions in both English and Turkish. The responses and their explanations were classified based on predefined criteria as “correct answer and explanation”, “correct answer with incorrect explanation” and “incorrect”.

The R programming language was used within the RStudio environment for statistical analysis. McNemar’s Chi-squared test with continuity correction was applied to analyze the models’ performance in providing correct answers and explanations across different languages, as well as to compare performance between models. Fisher’s Exact Test was used to analyze the models’ responses to different question types. The threshold for statistical significance was set at p < 0.05.

When analyzed individually, DeepSeek-R1, Gemini 2.0 and ChatGPT-4 provided correct answers at a higher rate in English compared to Turkish. In Turkish, the performance of DeepSeek-R1 and Gemini 2.0 in providing correct answers and accurate explanations was significantly higher than that of ChatGPT-4. All models demonstrated significantly better performance on Simple-style questions compared to Combination-style questions in both languages.

These findings indicate that LLMs show promise in standardized tests within dentistry. However, despite their ability to recognize patterns and organize data, they have limitations in fully understanding the underlying concepts of information. The results also highlight the need for continuous improvements to enhance their effectiveness across different subjects and languages, as well as the potential occurrence of hallucinations in their responses.

The online version contains supplementary material available at 10.1186/s12903-026-07753-5.

## Full-text entities

- **Genes:** NINL (ninein like) [NCBI Gene 22981] {aka NLP}
- **Diseases:** hallucinations (MESH:D006212), USMLE (MESH:D000069279), LLMs (MESH:D007806)
- **Species:** Homo sapiens (human, species) [taxon 9606], Meleagris gallopavo (common turkey, species) [taxon 9103]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12949502/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12949502/full.md

## References

7 references — full list in the complete paper: https://tomesphere.com/paper/PMC12949502/full.md

---
Source: https://tomesphere.com/paper/PMC12949502