Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection

Jovana Kljajic; John M. O'Toole; Robert Hogan; Tamara Skoric

arXiv:2508.04899·cs.LG·March 6, 2026

Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection

Jovana Kljajic, John M. O'Toole, Robert Hogan, Tamara Skoric

PDF

TL;DR

This paper critically evaluates performance metrics for neonatal seizure detection AI, proposing best practices and a multi-rater Turing test to ensure reliable, honest assessment aligned with clinical validation needs.

Contribution

It systematically assesses existing metrics, highlights their limitations, and introduces a comprehensive framework including a multi-rater Turing test for honest AI evaluation in neonatal seizure detection.

Findings

01

Matthews and Pearson's correlation outperform AUC under class imbalance

02

Consensus strategies are sensitive to number of raters and agreement levels

03

Multi-rater Turing test with Fleiss k best captures expert-level AI performance

Abstract

Reliable evaluation of machine learning models for neonatal seizure detection is critical for clinical adoption. Current practices often rely on inconsistent and biased metrics, hindering model comparability and interpretability. Expert-level claims about AI performance are frequently made without rigorous validation, raising concerns about their reliability. This study aims to systematically evaluate common performance metrics and propose best practices tailored to the specific challenges of neonatal seizure detection. Using real and synthetic seizure annotations, we assessed standard performance metrics, consensus strategies, and human-expert level equivalence tests under varying class imbalance, inter-rater agreement, and number of raters. Matthews and Pearson's correlation coefficients outperformed the area under the curve in reflecting performance under class imbalance. Consensus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.