# A Comparative Assessment of ChatGPT, Gemini, and DeepSeek Accuracy: Examining Visual Medical Assessment in Internal Medicine Cases with and Without Clinical Context

**Authors:** Rayah Asiri, Azfar Athar Ishaqui, Salman Ashfaq Ahmad, Muhammad Imran, Khalid Orayj, Adnan Iqbal

PMC · DOI: 10.3390/diagnostics16030388 · Diagnostics · 2026-01-26

## TL;DR

Adding brief clinical context significantly improves the diagnostic accuracy of general-purpose large language models on visual internal medicine cases.

## Contribution

This study quantifies how clinical context affects the diagnostic accuracy of ChatGPT, Gemini, and DeepSeek on visual medical cases.

## Key findings

- Including clinical history improved all models' accuracy, with ChatGPT rising from 50.7% to 80.4%.
- Gemini and DeepSeek also showed significant accuracy gains with clinical context, reaching 72.5% and 75.4%, respectively.
- Agreement with expert differential diagnoses remained low, with overlaps ranging from 6.99% to 36.39%.

## Abstract

Background and Aim: Large language models (LLMs) demonstrate significant potential in assisting with medical image interpretation. However, the diagnostic accuracy of general-purpose LLMs on image-based internal medicine cases and the added value of brief clinical history remain unclear. This study evaluated three general-purpose LLMs (ChatGPT, Gemini, and DeepSeek) on expert-curated cases to quantify diagnostic accuracy with image-only input versus image plus brief clinical context. Methods: We conducted a comparative evaluation using 138 expert-curated cases from Harrison’s Visual Case Challenge. Each case was presented to the models in two distinct phases: Phase 1 (image only) and Phase 2 (image plus a brief clinical history). The primary endpoint was top-1 diagnostic accuracy for the textbook diagnosis, comparing performance with versus without a brief clinical history. Secondary/Exploratory analyses compared models and assessed agreement between model-generated differential lists and the textbook differential. Statistical analysis included Wilson 95% confidence intervals, McNemar’s tests, Cochran’s Q with Benjamini–Hochberg correction, and Wilcoxon signed-rank tests. Results: The inclusion of clinical history substantially improved diagnostic accuracy for all models. ChatGPT’s accuracy increased from 50.7% in Phase 1 to 80.4% in Phase 2. Gemini’s accuracy improved from 39.9% to 72.5%, and DeepSeek’s accuracy rose from 30.4% to 75.4%. In Phase 2, diagnostic accuracy reached at least 65% across most disease nature and organ system categories. However, agreement with the reference differential diagnoses remained modest, with average overlap rates of 6.99% for ChatGPT, 36.39% for Gemini, and 32.74% for DeepSeek. Conclusions: The provision of brief clinical history significantly enhances the diagnostic accuracy of large language models on visual internal medicine cases. In this benchmark, performance differences between models were smaller in Phase 2 than in Phase 1. While diagnostic precision improves markedly, the models’ ability to generate comprehensive differential diagnoses that align with expert consensus is still limited. These findings underscore the utility of context-aware, multimodal LLMs for educational support and structured diagnostic practice in supervised settings while also highlighting the need for more sophisticated, semantics-sensitive benchmarks for evaluating diagnostic reasoning.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12897363/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12897363/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/PMC12897363/full.md

---
Source: https://tomesphere.com/paper/PMC12897363