Robust Machine Learning for Regulatory Sequence Modeling under Biological and Technical Distribution Shifts
Yiyao Yang

TL;DR
This paper evaluates the robustness of deep learning models in regulatory genomics under various biological and technical distribution shifts, highlighting robustness gaps and proposing methods to improve reliability.
Contribution
It introduces a comprehensive robustness framework combining simulation and real data analysis to assess and enhance model reliability under genomic distribution shifts.
Findings
Models are robust to mild GC content shifts but fail under motif rewiring and noise.
Adding biological priors improves robustness but has limited effect against high noise.
Uncertainty-aware prediction helps recover low-risk predictions under distribution shifts.
Abstract
Robust machine learning for regulatory genomics is studied under biologically and technically induced distribution shifts. Deep convolutional and attention based models achieve strong in distribution performance on DNA regulatory sequence prediction tasks but are usually evaluated under i.i.d. assumptions, even though real applications involve cell type specific programs, evolutionary turnover, assay protocol changes, and sequencing artifacts. We introduce a robustness framework that combines a mechanistic simulation benchmark with real data analysis on a massively parallel reporter assay (MPRA) dataset to quantify performance degradation, calibration failures, and uncertainty based reliability. In simulation, motif driven regulatory outputs are generated with cell type specific programs, PWM perturbations, GC bias, depth variation, batch effects, and heteroscedastic noise, and CNN,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene Regulatory Network Analysis · Single-cell and spatial transcriptomics · Genomics and Chromatin Dynamics
