GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

Alex Aizman; Abhishek Gaikwad; Piotr \.Zelasko

arXiv:2602.22434·cs.DC·February 27, 2026

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

Alex Aizman, Abhishek Gaikwad, Piotr \.Zelasko

PDF

Open Access

TL;DR

GetBatch is a new object store API that significantly improves data retrieval efficiency for ML training by enabling batch retrieval, reducing latency and increasing throughput compared to traditional individual GET requests.

Contribution

It introduces GetBatch, a first-class batch retrieval API for object stores, enhancing ML data loading performance over existing methods.

Findings

01

Achieves up to 15x throughput improvement for small objects.

02

Reduces P95 batch retrieval latency by 2x.

03

Reduces P99 per-object tail latency by 3.7x.

Abstract

Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Advanced Database Systems and Queries · Advanced Data Storage Technologies