TL;DR
The RISED Framework offers a comprehensive pre-deployment evaluation for clinical AI systems, addressing reliability, inclusivity, sensitivity, equity, and deployability to detect risks beyond traditional accuracy metrics.
Contribution
It introduces a novel five-dimension evaluation framework with formal criteria and thresholds, validated across diverse clinical datasets, enhancing deployment safety assessments.
Findings
Conventional high-discrimination classifiers can fail stability and sensitivity checks.
Failing dimensions vary across datasets, indicating context-specific risks.
The framework exposes construct validity issues in fairness assessments.
Abstract
Aggregate accuracy metrics dominate the evaluation of clinical AI decision-support systems but do not detect deployment-phase failures of input reliability, subgroup equity, threshold sensitivity, or operational feasibility. We propose the RISED Framework: a five-dimension pre-deployment evaluation covering Reliability, Inclusivity, Sensitivity, Equity, and Deployability, in which each dimension is operationalized through formal sub-criteria, pre-specified pass/fail thresholds, and bias-corrected accelerated (BCa) bootstrap 95% confidence intervals combined under a Holm-Bonferroni family-wise error correction. A central demonstration is that a classifier satisfying conventional high-discrimination benchmarks can simultaneously fail input-encoding stability and threshold-shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive, pointing to deployment risks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
