TL;DR
annbatch is a native anndata mini-batch loader that enables efficient out-of-core training on terabyte-scale biological datasets, significantly improving data loading speed and reducing training time.
Contribution
It introduces annbatch, a novel out-of-core data loader for anndata, facilitating scalable biological AI with large, diverse datasets.
Findings
Increases data loading throughput by up to tenfold.
Reduces training time from days to hours on large datasets.
Maintains full compatibility with the scverse ecosystem.
Abstract
The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
