# Accurate and Scalable Classification of Colonoscopy Neoplasia Using Machine Learning and Natural Language Processing

**Authors:** Brendan Broderick, Jason Greenwood, Douglas Mahoney, Kelli Burger, Sushil Kumar Garg, Michael B. Wallace, Suryakanth R. Gurudu, Derek Ebner, John Kisiel

PMC · DOI: 10.14309/ctg.0000000000000959 · Clinical and Translational Gastroenterology · 2024-12-17

## TL;DR

This study shows a machine learning system can accurately classify colonoscopy findings, helping monitor and improve colon cancer screening quality.

## Contribution

A novel random forest-based NLP system is developed for accurate and scalable classification of colorectal neoplasia from unstructured medical reports.

## Key findings

- The model achieved high accuracy (AUC 0.997 for adenomas) in classifying colonoscopy results.
- The system was validated on 337 procedures with excellent performance across all neoplasia types.
- The approach combines NLP and machine learning for explainable and scalable quality monitoring.

## Abstract

Colorectal cancer remains a leading cause of cancer associated death in the United States and colonoscopy the primary screening strategy for prevention. Rates of adenomatous and serrated neoplasia detection are inversely associated with postcolonoscopy colorectal cancer. This crucial quality metric depends on accurate ascertainment of colorectal neoplasia findings from both endoscopy and histopathology records. We aimed to assess the feasibility of a random forest machine learning model to rapidly and accurately categorize colorectal neoplasia from electronic health record data.

A retrospective cohort study compared neoplasia detection rates among individuals undergoing colonoscopy at a large academic institution to develop a rule-based algorithm to categorize colorectal neoplasia from endoscopy reports and pathology systematized nomenclature of medicine – clinical terms. This cohort provided a large training set to develop a natural language processing system using a random forest approach to automatically classify unstructured pathology findings into adenoma, serrated, or advanced neoplasms. This system was manually validated through an independent holdout set.

The training set comprised 35,953 unstructured pathology reports with matched systematized nomenclature of medicine – clinical terms from 95,188 unstructured colonoscopy reports. The final model was assessed on an independent holdout set of 337 manually annotated procedures obtaining an area under the receiver operating characteristic curve of 0.997 (confidence interval [CI] 0.994–1), 0.99 (CI 0.98–1), and 0.99 (CI 0.98–0.99) for prediction of adenoma, serrated, and advanced lesions, respectively.

The random forest-based hybrid natural language processing system for classification of colonoscopy results was both accurate and explainable. NLP combined with effective machine learning algorithms can provide a scalable strategy for colonoscopy quality monitoring.

## Linked entities

- **Diseases:** colorectal cancer (MONDO:0005575)

## Full-text entities

- **Diseases:** adenoma (MESH:D000236), adenoma, serrated, or advanced neoplasms (MESH:D009369), CRC (MESH:D015179)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12922929/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12922929/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/PMC12922929/full.md

---
Source: https://tomesphere.com/paper/PMC12922929