tf.data service: A Case for Disaggregating ML Input Data Processing

Andrew Audibert; Yang Chen; Dan Graur; Ana Klimovic; Jiri Simsa and; Chandramohan A. Thekkath

arXiv:2210.14826·cs.LG·January 3, 2024·1 cites

tf.data service: A Case for Disaggregating ML Input Data Processing

Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiri Simsa and, Chandramohan A. Thekkath

PDF

Open Access

TL;DR

The paper introduces tf.data service, a disaggregated data processing system for ML that improves resource utilization and training efficiency by scaling, sharing, and coordinating data preprocessing tasks.

Contribution

It presents a novel disaggregated data processing system that significantly enhances efficiency and scalability for large-scale machine learning training.

Findings

01

32x reduction in training time

02

26x cost savings

03

2.2x reduction in training time due to coordinated reads

Abstract

Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management

Methodstravel james