GetBatch: Distributed Multi-Object Retrieval for ML Data Loading
Alex Aizman, Abhishek Gaikwad, Piotr \.Zelasko

TL;DR
GetBatch is a new object store API that significantly improves data retrieval efficiency for ML training by enabling batch retrieval, reducing latency and increasing throughput compared to traditional individual GET requests.
Contribution
It introduces GetBatch, a first-class batch retrieval API for object stores, enhancing ML data loading performance over existing methods.
Findings
Achieves up to 15x throughput improvement for small objects.
Reduces P95 batch retrieval latency by 2x.
Reduces P99 per-object tail latency by 3.7x.
Abstract
Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Database Systems and Queries · Advanced Data Storage Technologies
