# Multicenter Clinical Validation of an Artificial Intelligence Diagnostic Classification Model for Laryngoscopy Images

**Authors:** Claudio Sampieri, Francesco Mora, Giorgio Peretti, Marc Larrosa, Isabel Vilaseca, Francesc X. Avilés‐Jurado, Alessandro Ioppi, Elisa Bellini, Berta Alegre, Laura Ruiz‐Sevilla, Rakesh Srivastava, Athanasios C. Sakellaridis, Andriana Razou, Georgios P. Kotsis, Sara Moccia, Leonardo S. Mattos, Chiara Baldini

PMC · DOI: 10.1002/ohn.70153 · Otolaryngology--Head and Neck Surgery · 2026-02-16

## TL;DR

This study validates an AI model that can classify laryngeal lesions from endoscopy images as high or low risk, performing as well as specialists and better than general doctors.

## Contribution

A multicenter clinical validation of an AI diagnostic model for laryngeal lesions, demonstrating generalizability and noninferior performance to clinicians.

## Key findings

- The AI model achieved 0.90/0.89 accuracy/AUC internally and 0.85/0.85 on external datasets.
- The model's performance was noninferior to otolaryngologists and superior to general practitioners and ChatGPT-4o.
- The model shows potential for use in resource-limited settings and is being tested in a prospective clinical trial.

## Abstract

To develop and externally validate a computer‐aided diagnosis (CADx) model using artificial intelligence (AI) for classifying laryngeal lesions from laryngoscopy images into high‐risk (HR), low‐risk (LR).

Retrospective multicenter development of a CADx model and external validation on independent cohorts.

Multicenter tertiary referral hospitals (Italy, India, China, Greece, and Spain).

Over 20,000 images derived from laryngoscopic examinations were retrieved. Images were annotated based on histopathology or expert consensus. A deep learning model was trained using an internal dataset and evaluated on 2 external datasets to assess generalizability. The CADx model classifies only images containing visible lesions, discriminating between LR and HR categories. Diagnostic performance was measured using standard metrics, including accuracy, precision, recall, F1‐score, and area under the receiver operating characteristic curve (AUC). Model performance was compared with physicians of varying expertise and ChatGPT‐4o.

The computer‐aided diagnosis model achieved a similar performance across internal and external datasets in distinguishing HR from LR lesions, with accuracy/AUC of 0.90/0.89 internally, 0.85/0.85 on the Greek dataset, and 0.88/0.88 on the Spanish dataset. The model's accuracy was statistically noninferior to that of otolaryngologists and expert laryngologists, and superior to general practitioners and ChatGPT‐4o.

This is a large multicenter clinical validation of a CADx model for laryngeal endoscopy, demonstrating generalizability and performance comparable to clinicians in discriminating between LR and HR lesions. The model's success supports its potential role in augmenting diagnostic capabilities, especially in resource‐limited settings. A prospective multicenter clinical trial is underway to assess real‐world clinical implementation.

## Full-text entities

- **Diseases:** laryngeal lesions (MESH:D007818)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13035010/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13035010/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/PMC13035010/full.md

---
Source: https://tomesphere.com/paper/PMC13035010