The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution
Frank Sifei Luan, Ron Yifeng Wang, Yile Gu, Ziming Mao, Charlotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow, Eric Liang, Ion Stoica, Stephanie Wang

TL;DR
The paper introduces the streaming batch model and Ray Data system, combining batch and streaming to enable efficient, fault-tolerant heterogeneous execution, significantly improving throughput for ML training and inference.
Contribution
It proposes a novel streaming batch model and implements Ray Data, enhancing throughput and resource utilization in heterogeneous ML workloads.
Findings
Ray Data improves throughput by 2.5-12× over traditional systems.
Training multimodal models like Stable Diffusion sees a 31% increase in throughput.
The model enables elastic, memory-efficient, and fault-tolerant heterogeneous execution.
Abstract
While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements. They excel at CPU-based computation but either under-utilize heterogeneous resources or impose high overheads on failure and reconfiguration. We introduce the streaming batch model, a hybrid of batch and streaming that enables efficient and fault-tolerant heterogeneous execution. The key idea is to use partitions as the unit of execution to achieve elasticity, but to allow partitions to be dynamically created and streamed between heterogeneous operators for memory-efficient pipelining. We present Ray Data, a streaming batch system that improves throughput on heterogeneous batch inference pipelines by 2.5-12 compared to traditional batch and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Distributed and Parallel Computing Systems
MethodsDiffusion
