# External Validation of an Artificial Intelligence Triaging System for Chest X-Rays: A Retrospective Independent Clinical Study

**Authors:** André Coutinho Castilla, Iago de Paiva D’Amorim, Maria Fernanda Barbosa Wanderley, Mateus Aragão Esmeraldo, André Ricca Yoshida, Anthony Moreno Eigier, Márcio Valente Yamada Sawamura

PMC · DOI: 10.3390/diagnostics15222899 · 2025-11-15

## TL;DR

This study validates an AI system for chest X-ray triage, showing it can effectively prioritize urgent cases and reduce reporting delays in emergency care.

## Contribution

The paper presents an external validation of TRIA, a commercial AI triage system for chest X-rays, demonstrating its robust performance in a real-world clinical setting.

## Key findings

- The general abnormality classifier achieved an AUROC of 0.911, indicating strong performance in distinguishing normal from abnormal chest X-rays.
- The weighted ensemble model demonstrated the best balance with an accuracy of 0.854 and an AUROC of 0.927.
- Sensitivity-prioritized methods had high sensitivity (>0.92) but lower specificity (<0.69), highlighting a trade-off in performance.

## Abstract

Background: Chest radiography (CXR) is the most frequently performed radiological exam worldwide, but reporting backlogs, caused by a shortage of radiologists, remain a critical challenge in emergency care. Artificial intelligence (AI) triage systems can help alleviate this challenge by differentiating normal from abnormal studies and prioritizing urgent cases for review. This study aimed to externally validate TRIA, a commercial AI-powered CXR triage algorithm (NeuralMed, São Paulo, Brazil). Methods: TRIA employs a two-stage deep learning approach, comprising an image segmentation module that isolates the thoracic region, followed by a classification model trained to recognize common cardiopulmonary pathologies. We trained the system on 275,399 CXRs from multiple public and private datasets. We performed external validation retrospectively on 1045 CXRs (568 normal and 477 abnormal) from a teaching university hospital that was not used for training. We established ground truth using a large language model (LLM) to extract findings from original radiologist reports. An independent radiologist review of a 300-report subset confirmed the reliability of this method, achieving an accuracy of 0.98 (95% CI 0.978–0.988). We compared four ensemble decision strategies for abnormality detection. Performance metrics included sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (AUROC) with 95% CI. Results: The general abnormality classifier achieved strong performance (AUROC 0.911). Individual pathology models for cardiomegaly, pneumothorax, and effusion showed excellent results (AUROC of 0.968, 0.955, and 0.935, respectively). The weighted ensemble demonstrated the best balance, with an accuracy of 0.854 (95% CI, 0.831–0.874), a sensitivity of 0.845 (0.810–0.875), a specificity of 0.861 (0.830–0.887), and an AUROC of 0.927 (0.911–0.940). Sensitivity-prioritized methods achieving sensitivity >0.92 produced lower specificity (<0.69). False negatives were mainly subtle or equivocal cases, although many were still flagged as abnormal by the general classifier. Conclusions: TRIA achieved robust and balanced accuracy in distinguishing normal from abnormal CXRs. Integrating this system into clinical workflows has the potential to reduce reporting delays, prioritize urgent cases, and improve patient safety. These findings support its clinical utility and warrant prospective multicenter validation.

## Linked entities

- **Diseases:** pneumothorax (MONDO:0002076)

## Full-text entities

- **Diseases:** cardiomegaly (MESH:D006332), pneumothorax (MESH:D011030), effusion (MESH:D000080324)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12651339/full.md

---
Source: https://tomesphere.com/paper/PMC12651339