annbatch unlocks terabyte-scale training of biological data in anndata

Ilan Gold; Felix Fischer; Lucas Arnoldt; F. Alexander Wolf; Fabian J. Theis

arXiv:2604.01949·cs.LG·April 6, 2026

annbatch unlocks terabyte-scale training of biological data in anndata

Ilan Gold, Felix Fischer, Lucas Arnoldt, F. Alexander Wolf, Fabian J. Theis

PDF

1 Repo

TL;DR

annbatch is a native anndata mini-batch loader that enables efficient out-of-core training on terabyte-scale biological datasets, significantly improving data loading speed and reducing training time.

Contribution

It introduces annbatch, a novel out-of-core data loader for anndata, facilitating scalable biological AI with large, diverse datasets.

Findings

01

Increases data loading throughput by up to tenfold.

02

Reduces training time from days to hours on large datasets.

03

Maintains full compatibility with the scverse ecosystem.

Abstract

The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scverse/annbatch
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.