scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics
Davide D'Ascenzo, Sebastiano Cultrera di Montesano

TL;DR
scDataset is a novel PyTorch data loader that efficiently trains deep learning models on massive single-cell datasets by combining block sampling and batched fetching, significantly reducing I/O overhead while maintaining data diversity.
Contribution
It introduces a scalable data loading method that balances efficiency and diversity, enabling practical training on datasets exceeding memory capacity.
Findings
Achieves over 100x speedup compared to true random sampling.
Maintains comparable model performance to true random sampling.
Works directly with large AnnData files without loading entire datasets into memory.
Abstract
Training deep learning models on single-cell datasets with hundreds of millions of cells requires loading data from disk, as these datasets exceed available memory. While random sampling provides the data diversity needed for effective training, it is prohibitively slow due to the random access pattern overhead, whereas sequential streaming achieves high throughput but introduces biases that degrade model performance. We present scDataset, a PyTorch data loader that enables efficient training from on-disk data with seamless integration across diverse storage formats. Our approach combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. On Tahoe-100M, a dataset of 100 million cells, scDataset achieves more than two orders of magnitude speedup compared to true random sampling while working directly with AnnData…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Cell Image Analysis Techniques · Stochastic Gradient Optimization Techniques
