BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training
Ting Sun, Junjie Zhang, Xiao Yan, Songxin Zhang, Zhuoyang Song, Jingyi Xi, Zunyao Mao, Bingyi Jing, Jiaxing Zhang, Zejian Xie

TL;DR
BatchWeave introduces a novel object-store-native data plane for large foundation model training, enabling consistent, failure-isolated, and high-throughput data management tailored for distributed training environments.
Contribution
It proposes the Transactional Global Batch and Decentralized Adaptive Commit algorithms, enhancing consistency, recovery, and throughput in large-scale distributed training data pipelines.
Findings
Outperforms colocated dataloaders in throughput with failure isolation.
Achieves higher ingestion throughput than Apache Kafka.
Provides lower consumer read latency compared to Kafka.
Abstract
Modern Large Foundation Model (LFM) training has transformed the data pipeline from a static ingestion layer into a dynamic component that must co-evolve with the training process. Existing systems are ill-equipped: colocated dataloaders offer no failure isolation, while message queue-based disaggregated dataloaders operate on a record/offset abstraction that cannot express the batch-level semantics required by distributed training. We present BatchWeave, an object-store-native training data plane for distributed LFM training. BatchWeave uses versioned manifests and conditional object writes to coordinate batch publication, recovery, and lifecycle management. First, it introduces the Transactional Global Batch (TGB), which builds on versioned-manifest ACID storage semantics and extends them with training-specific consistency, including atomic all-rank batch visibility, a globally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
