# TRIAGE: Trustworthy Reporting and Assessment for Clinical Gain and Effectiveness of AI Models

**Authors:** Farzaneh Fazilati, Mohammad Zakaria Rajabi, Nima Alihosseini, Mohaddeseh Esmaeili Farsani, Seyed Hasan Sandid, Shadi Zamani, Mehrshad Alirezaei Farahani, Fateme Biriaei, Fateme Sadeghipour, Mohammad Taha Mirshamsi, Mottahareh Fahami, Hamid Reza Marateb

PMC · DOI: 10.3390/diagnostics16050666 · Diagnostics · 2026-02-25

## TL;DR

This paper introduces TRIAGE, a framework to evaluate AI models in clinical settings using comprehensive metrics and strategies for reliable and safe adoption.

## Contribution

TRIAGE offers a novel, clinically aligned evaluation framework for diagnostic AI models with structured metrics and reporting guidelines.

## Key findings

- TRIAGE emphasizes threshold-dependent evaluation using representation curves and calibration metrics.
- The framework includes strategies for multi-class and multi-label tasks with aggregation methods like micro and macro averaging.
- TRIAGE addresses robustness, fairness, and deployment constraints like latency and energy use.

## Abstract

Machine learning (ML), including deep learning, kernel-based classifiers, and ensemble methods, is increasingly used to support clinical diagnosis in medical imaging, biosignal interpretation, and electronic health record (EHR)-based decision support. Despite rapid progress, many diagnostic AI studies still rely on limited retrospective evaluation and single summary measures (e.g., accuracy or AUC), creating a gap between reported model performance and evidence required for safe clinical adoption. This review proposes TRIAGE, a clinically grounded evaluation framework designed to organize diagnostic AI testing as an evidence pipeline aligned with real clinical use cases (screening, triage, second reading, and confirmatory testing). We summarize core discrimination metrics derived from the confusion matrix (sensitivity, specificity, predictive values, likelihood ratios, diagnostic odds ratio, and F-scores) and highlight the importance of prevalence and spectrum effects for interpreting predictive value and clinical workload. We further review evaluation strategies for multi-class and multi-label diagnostic tasks using appropriate aggregation methods (micro, macro, and weighted averaging) and set-based measures such as Hamming loss, exact match ratio, and Jaccard/IoU. Because diagnostic deployment is threshold-dependent, we integrate representation curves (ROC, precision–recall, lift, and cumulative gain) with calibration assessment and clinical utility analysis, including calibration slope, Brier score, and decision-curve analysis. We also address robustness and fairness evaluation, leakage-resistant validation designs (patient-grouped splits, stratified and temporal validation, and external validation), computational constraints relevant to deployment (latency, throughput, and energy use), and statistically sound model comparison with multiplicity control. A structured TRIAGE checklist table summarizing the evaluation parameters described in this review is provided in the main text to support reproducible and clinically interpretable reporting.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12984829/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12984829/full.md

## References

155 references — full list in the complete paper: https://tomesphere.com/paper/PMC12984829/full.md

---
Source: https://tomesphere.com/paper/PMC12984829