Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset

Shruti Hegde; Mabon Manoj Ninan; Jonathan R. Dillman; Shireen Hayatghaibi; Lynn Babcock; Elanchezhian Somasundaram

arXiv:2505.23030·cs.CL·May 30, 2025

Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset

Shruti Hegde, Mabon Manoj Ninan, Jonathan R. Dillman, Shireen Hayatghaibi, Lynn Babcock, Elanchezhian Somasundaram

PDF

Open Access

TL;DR

This study evaluates the reliability of various commercial and open-source NLP tools in automatically annotating pediatric chest radiograph reports, revealing significant variability and emphasizing the need for validation before clinical deployment.

Contribution

It provides a comprehensive comparison of multiple NLP solutions on a large pediatric dataset, highlighting their strengths and limitations in clinical report annotation.

Findings

01

Significant differences in entity extraction counts across NLP systems.

02

Assertion detection accuracy varied, with SparkNLP achieving the highest at 76%.

03

Commercial NLP tools showed variable performance, underscoring the need for validation.

Abstract

General-purpose clinical natural language processing (NLP) tools are increasingly used for the automatic labeling of clinical reports. However, independent evaluations for specific tasks, such as pediatric chest radiograph (CXR) report labeling, are limited. This study compares four commercial clinical NLP systems - Amazon Comprehend Medical (AWS), Google Healthcare NLP (GC), Azure Clinical NLP (AZ), and SparkNLP (SP) - for entity extraction and assertion detection in pediatric CXR reports. Additionally, CheXpert and CheXbert, two dedicated chest radiograph report labelers, were evaluated on the same task using CheXpert-defined labels. We analyzed 95,008 pediatric CXR reports from a large academic pediatric hospital. Entities and assertion statuses (positive, negative, uncertain) from the findings and impression sections were extracted by the NLP systems, with impression section…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · Artificial Intelligence in Healthcare and Education · Topic Modeling