# SLEEPYLAND: trust begins with fair evaluation of automatic sleep staging models

**Authors:** Alvise Dei Rossi, Matteo Metaldi, Michal Bechny, Irina Filchenko, Julia van der Meer, Markus H. Schmidt, Claudio L. A. Bassetti, Athina Tzovara, Francesca D. Faraci, Luigi Fiorillo

PMC · DOI: 10.1038/s41746-025-02237-2 · NPJ Digital Medicine · 2025-12-16

## TL;DR

SLEEPYLAND is an open-source framework that improves the evaluation and generalization of automatic sleep staging models using diverse data and an ensemble method called SOMNUS.

## Contribution

SLEEPYLAND introduces a large-scale dataset and an ensemble model (SOMNUS) that outperforms existing methods in sleep staging.

## Key findings

- SOMNUS outperforms individual models in 94.9% of cases across 24 datasets.
- SOMNUS surpasses the best human scorer on multi-annotated datasets.
- Ensemble disagreement metrics predict scorer ambiguity with 82.8% ROC-AUC.

## Abstract

Automatic sleep staging with deep learning has advanced considerably, yet clinical adoption remains hindered by limited generalization, model bias, and inconsistent evaluation practices. We present SLEEPYLAND, an open-source framework comprising ~ 220,000 h of in-domain and ~ 84,000 h of out-of-domain polysomnographic recordings, spanning diverse ages, disorders, and hardware configurations. We release pre-trained state-of-the-art models, evaluating them across single- and multi-channel EEG/EOG setups. We introduce SOMNUS, an ensemble that integrates models via soft-voting, achieving robust performance across 24 datasets (macro-F1, 68.7–87.2%), outperforming individual models in 94.9% of cases and exceeding prior state-of-the-art. Exploiting the Bern-Sleep-Wake-Registry (N = 6633), we show that while SOMNUS improves generalization, no model architecture consistently minimizes model demographic/clinical bias. On multi-annotated datasets, SOMNUS surpasses the best human scorer (macro-F1, 85.2% vs 80.8% on DOD-H, and 80.2% vs 75.9% on DOD-O), more closely reproducing consensus. Finally, ensemble disagreement metrics predict scorer ambiguity (ROC-AUC 82.8%), providing reliable proxies for human uncertainty.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12816009/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12816009/full.md

## References

18 references — full list in the complete paper: https://tomesphere.com/paper/PMC12816009/full.md

---
Source: https://tomesphere.com/paper/PMC12816009