Quantifying the Generalization Gap in Seizure Detection: A Large-Scale Empirical Benchmark via the SzCORE Challenge
Jonathan Dan, Amirhossein Shahbazinia, Christodoulos Kechris, David Atienza

TL;DR
This large-scale empirical benchmark assesses 28 seizure detection algorithms on a private EEG dataset, revealing significant variability and highlighting the need for standardized evaluation to improve model generalization.
Contribution
The study provides a comprehensive evaluation of diverse algorithms on a large, private EEG dataset, establishing a rigorous benchmarking platform for seizure detection.
Findings
Top F1 score of 32% with sensitivity 37%, precision 29%
Significant performance variability among algorithms
Discrepancy between peak performance and population stability
Abstract
Reliable automatic seizure detection from long-term electroencephalography (EEG) remains an unsolved challenge, as current models often fail to generalize across patients or clinical settings. Manual EEG review still is the standard of care, highlighting the need for robust models and standardized evaluation. The current literature often reports high efficacy, yet these models frequently fail when deployed to unseen patient populations. To rigorously assess this generalization gap, we conducted a large-scale empirical study evaluating 28 state-of-the-art algorithmic architectures, ranging from classical feature engineering to modern Deep Learning. These algorithms were collected by organizing a competition. A strictly held-out private dataset of continuous EEG recordings from 65 subjects, totaling 4,360 hours of data, was utilized to evaluate algorithm performance. Expert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
