tf.data service: A Case for Disaggregating ML Input Data Processing
Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiri Simsa and, Chandramohan A. Thekkath

TL;DR
The paper introduces tf.data service, a disaggregated data processing system for ML that improves resource utilization and training efficiency by scaling, sharing, and coordinating data preprocessing tasks.
Contribution
It presents a novel disaggregated data processing system that significantly enhances efficiency and scalability for large-scale machine learning training.
Findings
32x reduction in training time
26x cost savings
2.2x reduction in training time due to coordinated reads
Abstract
Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
Methodstravel james
