Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection
Jovana Kljajic, John M. O'Toole, Robert Hogan, Tamara Skoric

TL;DR
This paper critically evaluates performance metrics for neonatal seizure detection AI, proposing best practices and a multi-rater Turing test to ensure reliable, honest assessment aligned with clinical validation needs.
Contribution
It systematically assesses existing metrics, highlights their limitations, and introduces a comprehensive framework including a multi-rater Turing test for honest AI evaluation in neonatal seizure detection.
Findings
Matthews and Pearson's correlation outperform AUC under class imbalance
Consensus strategies are sensitive to number of raters and agreement levels
Multi-rater Turing test with Fleiss k best captures expert-level AI performance
Abstract
Reliable evaluation of machine learning models for neonatal seizure detection is critical for clinical adoption. Current practices often rely on inconsistent and biased metrics, hindering model comparability and interpretability. Expert-level claims about AI performance are frequently made without rigorous validation, raising concerns about their reliability. This study aims to systematically evaluate common performance metrics and propose best practices tailored to the specific challenges of neonatal seizure detection. Using real and synthetic seizure annotations, we assessed standard performance metrics, consensus strategies, and human-expert level equivalence tests under varying class imbalance, inter-rater agreement, and number of raters. Matthews and Pearson's correlation coefficients outperformed the area under the curve in reflecting performance under class imbalance. Consensus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
