# Large Language Model‐Driven Analysis and Report Generation of Endoscopy Videos—A Pilot Study

**Authors:** Davide Massimi, Luca Di Stefano, Tommy Rizkala, Marco Spadaccini, Yuichi Mori, Maddalena Menini, Giulio Antonelli, Kareem Khalaf, Raf Bisschops, Daniel von Renteln, Prateek Sharma, Douglas K. Rex, Michael Bretthauer, Carlo Castoro, Roberto De Sire, Roberto De Sire, Ludovico Alfarone, Alessandro D’Aprano, Silvia Carrara, Roberta Maselli, Vincenzo Vadalà, Francesco Menini, Abdelrahman Ashraf Alawdy Elsaman, Alessandro Fugazza, Matteo Colombo, Renato de Martino, Antonio Capogreco, Gianluca Franchelucci, Victor Savevski, Elena De Momi, Luca Carlini, Chiara Lena, Sravanthi Parasa, Susanne O’Reilly, Simone Dibitetto, Matteo Spertino, Alessandro Repici, Cesare Hassan

PMC · DOI: 10.1111/den.70134 · Digestive Endoscopy · 2026-03-10

## TL;DR

This pilot study tested if a large language model can generate adequate endoscopy reports and found its performance to be insufficient for clinical use.

## Contribution

The study is the first to evaluate the impact of CAD overlays on MLLM performance in EGD reporting.

## Key findings

- Gemini 2.5 Pro showed inadequate performance for clinical EGD reporting.
- CAD overlays reduced the model's accuracy in detecting landmarks.
- The study highlights the need for optimization and larger-scale validation.

## Abstract

Multimodal large language models (MLLMs) can automatically analyze clinical video, but evidence from full esophagogastroduodenoscopy (EGD) and the impact of on‐screen computer‐aided detection/diagnosis (CAD) overlays on MLLM behavior remain unclear. We tested whether an MLLM can produce clinically adequate EGD reports and whether a CAD overlay changes performance. We analyzed five complete EGD videos with Gemini 2.5 Pro in paired versions: (1) clean video and (2) the same video with a CAD overlay. Five blinded endoscopists rated report adequacy in three domains. MLLM accuracy for landmarks/lesions was further assessed by two blinded expert endoscopists using the time‐window rule (a model detection counted as correct if it occurred within ±2 s of the expert‐annotated timestamp). In this retrospective pilot study, five archived diagnostic EGD procedures from five patients were available as full‐length videos. Across five raters, MLLM Completeness was judged adequate in 56.0% (14/25 ratings) with Clean‐Video versus 48.0% (12/25 ratings) with Overlay‐Video (p = 0.500). Visualization was identical (36.0% [9/25 ratings] for both; p = 1.000). Lesions characteristics were identical (16.0% [4/25] for both; p = 1.00). For the Landmark agreement, the overall accuracy of the MLLM with Clean‐Video vs. Overlay‐Video was: 0.55 [95% CI 0.43–0.67] vs. 0.33 [0.23–0.46], p = 0.029; sensitivity 0.53 [0.40–0.66] vs. 0.35 [0.24–0.49], p = 0.122; specificity 0.67 [0.35–0.88] vs. 0.22 [0.06–0.55], p = 0.125. In this pilot study, Gemini 2.5 Pro demonstrated inadequate performance for clinical EGD reporting. These hypothesis‐generating findings suggest substantial optimization and larger‐scale validation are required before deployment.

## Full-text entities

- **Genes:** ACOD1 (aconitate decarboxylase 1) [NCBI Gene 730249] {aka CAD, IRG1}
- **Diseases:** gastritis (MESH:D005756), dyspepsia (MESH:D004415), polyp (MESH:D011127), lesion (MESH:D009059), mucosal abnormalities (MESH:D052016), ulcers (MESH:D014456), erosions (MESH:D014077), infection (MESH:D007239), epigastric pain (MESH:D010146), Barrett (MESH:D001471), anemia (MESH:D000740)
- **Chemicals:** propofol (MESH:D015742), Gemini (-)
- **Species:** Homo sapiens (human, species) [taxon 9606], Helicobacter pylori (species) [taxon 210]
- **Mutations:** G072621N

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12972633/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12972633/full.md

## References

6 references — full list in the complete paper: https://tomesphere.com/paper/PMC12972633/full.md

---
Source: https://tomesphere.com/paper/PMC12972633