# Assessing the Clinical Utility of Multimodal Large Language Models in the Diagnosis and Management of Pigmented Choroidal Lesions

**Authors:** Nehal Nailesh Mehta, Evan Walker, Elena Flester, Gillian Folk, Akshay Agnihotri, Ines D. Nagel, Melanie Tran, Michael H. Goldbaum, Shyamanga Borooah, Nathan L. Scott

PMC · DOI: 10.1167/tvst.14.10.13 · Translational Vision Science & Technology · 2025-10-14

## TL;DR

This study compares how well advanced AI models and human experts diagnose and treat retinal lesions, finding that human experts perform better but AI shows some promise.

## Contribution

The study evaluates multimodal large language models for diagnosing choroidal lesions and comparing them to human experts for the first time.

## Key findings

- Gemini outperformed ChatGPT and Perplexity in diagnostic and treatment recommendations.
- Human graders outperformed all AI models in accuracy and sensitivity.
- AI performance did not improve significantly with additional clinical data.

## Abstract

To evaluate the diagnostic and treatment recommendation performance of multimodal large language models (MLLMs) in identifying and classifying retinal lesions as choroidal nevus or melanoma, as well as compare their performance with expert human graders.

This retrospective cross-sectional study included 48 eyes from 47 patients diagnosed with either choroidal nevus or melanoma. Patient demographics, including age, sex, ethnicity, best-corrected visual acuity (BCVA), and symptoms, were documented. Color fundus, autofluorescence, optical coherence tomography, and B-scan images were collected. The ocular images and patient characteristics were presented to ChatGPT 4.0, Gemini Advanced 1.5 Pro, and Perplexity Pro. Responses were recorded and compared with the clinical diagnoses and treatment recommendations made by two expert human graders. Diagnostic and treatment agreement, accuracy, sensitivity, and specificity were analyzed.

Gemini consistently outperformed ChatGPT and Perplexity across diagnostic and treatment prompts. The highest model performance was observed for prompts requesting treatment recommendations with clinical information, where Gemini achieved the highest accuracy (0.725), followed by Perplexity (0.647) and ChatGPT (0.314). Performance was lowest for prompts requiring strict clinical criteria, with all models showing poor sensitivity. Both human graders outperformed all MLLMs in accuracy and sensitivity on most prompts (P < 0.005). Accuracy did not improve when provided demographic or clinical data, except for Gemini.

Human graders outperform current MLLMs, which show only moderate ability to diagnose choroidal nevi or melanoma from imaging.

This study highlights limitations and potential of MLLMs in aiding diagnosis and treatment of choroidal lesions.

## Linked entities

- **Diseases:** melanoma (MONDO:0005105)

## Full-text entities

- **Diseases:** choroidal nevi (MESH:D009506), melanoma (MESH:D008545), retinal lesions (MESH:D012164), Pigmented Choroidal Lesions (MESH:D015862), choroidal nevus (MESH:D002833)
- **Chemicals:** Gemini (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12530446/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12530446/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC12530446/full.md

---
Source: https://tomesphere.com/paper/PMC12530446