sleep2vec: Unified Cross-Modal Alignment for Heterogeneous Nocturnal Biosignals
Weixuan Yuan, Zengrui Jin, Yichen Wang, Donglin Xie, Ziyi Ye, Chao Zhang, Xuesong Chen

TL;DR
sleep2vec is a foundation model that learns a unified, robust representation of heterogeneous nocturnal biosignals through cross-modal alignment, improving sleep staging and clinical assessments despite sensor heterogeneity and dropout.
Contribution
It introduces sleep2vec, a contrastively pre-trained model that aligns multiple biosignal modalities using metadata-aware objectives, advancing unified modeling of nocturnal biosignals.
Findings
Outperforms strong baselines in sleep staging and clinical outcome tasks.
Remains robust with any subset of available modalities and sensor dropout.
Characterizes scaling laws for nocturnal biosignals with respect to modality diversity and model capacity.
Abstract
Tasks ranging from sleep staging to clinical diagnosis traditionally rely on standard polysomnography (PSG) devices, bedside monitors and wearable devices, which capture diverse nocturnal biosignals (e.g., EEG, EOG, ECG, SpO). However, heterogeneity across devices and frequent sensor dropout pose significant challenges for unified modelling of these multimodal signals. We present \texttt{sleep2vec}, a foundation model for diverse and incomplete nocturnal biosignals that learns a shared representation via cross-modal alignment. \texttt{sleep2vec} is contrastively pre-trained on 42,249 overnight recordings spanning nine modalities using a \textit{Demography, Age, Site \& History-aware InfoNCE} objective that incorporates physiological and acquisition metadata (\textit{e.g.}, age, gender, recording site) to dynamically weight negatives and mitigate cohort-specific shortcuts. On…
Peer Reviews
Decision·ICLR 2026 Poster
Strengths The paper addresses an important practical problem in sleep medicine and mobile or low burden monitoring, which is the presence of many possible channel layouts and frequent missing sensors. Showing that one pre trained model can handle nine different signal types and remain robust when some are absent is a meaningful step toward realistic deployment across devices and centers. The methodological core is coherent. The use of two modality batches, masking, a single backbone and a shared
Weaknesses Although the paper claims better cross site generalization through metadata aware weighting, the current experiments do not fully isolate this effect. It would be more convincing to show a split in which one cohort is entirely held out during pre training and used only for evaluation, and to show that the gap between the standard InfoNCE and DASH InfoNCE enlarges in that setting. At present, the evidence comes from aggregate metrics and from the claim that site and demographic similar
- The paper is very easy to follow. - The paper proposes a new representation learning for all the channels of a PSG. - The method is tested over a big corpus of subjects comprising more than 30.000 subjects. - With the gating mechanism, we can see which channels bring more importance for the classification, giving good interpretability of the model. - Good t-SNR visualization that gives insight into understanding the use of the proposed method.
- Figure 2 introduces the Intra-subject and Inter-subject segments. This is never used in the entire paper. This additional information, in my opinion, is likely to lead to a misunderstanding of the method. - The motivation is that no model deals with the full channels of PSG. In Table 1, two competitors are proposed for a full channel setting. Does that mean the model can handle all the channels? What is the addition of sleep2vec? - The competitors presented in Table 1 are never introduced. I
1. Multi-task pretraining objective: Combines reconstruction and contrastive objectives to strengthen inter-modal coordination and representation learning. 2. Cross-modal modeling innovation: The modality reconstruction task effectively addresses missing-modality scenarios. 3. Comprehensive experiments: Covers a wide range of datasets and tasks, demonstrating strong adaptability.
1. Limited originality: The approach lacks novelty, as many prior works have already explored missing-modality and contrastive learning in sleep research, such as CIMSleepNet (NeurIPS 2024), MultiConsSleepNet (IEEE JBHI 2025), and SleepSMC (ICLR 2025). 2. Outdated baselines: The comparison methods are mostly old, missing fair comparisons with the latest relevant works mentioned above. 3. Lack of ablation studies: The paper does not explicitly analyze the independent contributions of each modul
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Non-Invasive Vital Sign Monitoring · ECG Monitoring and Analysis
