# Poster Session I - A55 PERFORMANCE OF LARGE LANGUAGE MODELS IN THE OPTICAL DIAGNOSIS OF COLORECTAL POLYPS

**Authors:** J C Vences, W T Tran, N Gimpaya, C M Walsh, R Khan, R Bechara, D von Renteln, S Grover

PMC · DOI: 10.1093/jcag/gwaf042.055 · 2026-02-13

## TL;DR

This study evaluates how well large language models can help diagnose colorectal polyps using optical methods, comparing their performance to expert guidelines and traditional systems.

## Contribution

The paper introduces the first evaluation of large language models in applying NICE and JNET classification systems for colorectal polyp diagnosis.

## Key findings

- Claude Opus 4 and GPT-5 showed the highest accuracy in classifying polyps using the Paris system.
- NICE classification had the highest percent correct scores among all tested models.
- Sensitivity and specificity of MLLMs did not meet ESGE standards, indicating a need for further refinement before clinical use.

## Abstract

Optical diagnosis allows for rapid endoscopic decision-making, but many practitioners are inadequately trained. Large language models, such as Anthropic’s Claude Opus 4, have not been evaluated in their ability to apply the NICE or JNET classification systems to colorectal polyps to more reliably predict lesion histologies. Additionally, there is a limited reference base for ideal prompting strategies in gastrointestinal endoscopy.

The diagnostic accuracy of multimodal large language models (MLLMs) in classifying colorectal polyps and predicting histology will be comparable to societal (ASGE and ESGE) guidelines and to that of traditional computer-aided diagnostic systems.

We conducted a retrospective diagnostic accuracy study using the PRIME dataset, a curated set of white light and narrow-band imaging (NBI) images. We evaluated Claude Opus 4, Google Gemini 2.5 Pro, GPT-o3, GPT-4o, and GPT-5. For Paris, Narrow-band Imaging Colorectal Endoscopic (NICE), and classifications and predicted histology, we calculated percent correct scores and accuracy of each MLLM compared to expert responses for 132 cases. For the Japanese NBI Expert Team (JNET) classification, we analyzed 82 cases. We calculated accuracy, sensitivity, specificity, and positive and negative predictive values. Cochran’s Q and McNemar’s Test were used to determine differences between the predicted values of each MLLM.

Claude Opus 4 and GPT-5 had significantly higher percent correct scores than other MLLMs at 41.7%, using Paris classification. The highest percent correct score range among all the classifications was NICE, which had a range of 78.0% to 84.1%. The accuracy scores among MLLMs were >80% for all models for neoplastic vs. non-neoplastic polyps.

Claude Opus 4 and Gemini 2.5 Pro showed the highest accuracy in differentiating colorectal polyp subtypes, performing closest to expert consensus. Sensitivity and specificity, however, did not meet ESGE standards, highlighting the need for prospective multicenter trials and the design of human-in-the-loop workflows before clinical deployment.

Pairwise comparisons showed statistical difference (p < 0.05) based on McNemar’s test (a) with Gemini 2.5 Pro, GPT-o3, and GPT-4o; (b) with Claude Opus 4 and Gemini 2.5 Pro; (c) with Claude Opus 4, GPT-4o, and GPT-5, (d) with Claude Opus 4, GPT-o3, and GPT-5; (e) with Gemini 2.5 Pro, GPT-o3, and GPT-4o.

Pialis Family Chair in Education

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12901620/full.md

---
Source: https://tomesphere.com/paper/PMC12901620