Characterizing Deep-Learning I/O Workloads in TensorFlow

Steven W. D. Chien; Stefano Markidis; Chaitanya Prasad Sishtla; Luis; Santos; Pawel Herman; Sai Narasimhamurthy; Erwin Laure

arXiv:1810.03035·cs.DC·April 10, 2019

Characterizing Deep-Learning I/O Workloads in TensorFlow

Steven W. D. Chien, Stefano Markidis, Chaitanya Prasad Sishtla, Luis, Santos, Pawel Herman, Sai Narasimhamurthy, Erwin Laure

PDF

TL;DR

This paper analyzes TensorFlow's I/O performance, identifies bottlenecks, and proposes a burst buffer solution to significantly improve checkpointing efficiency and overall training performance.

Contribution

It provides a detailed characterization of TensorFlow's I/O behavior and introduces a burst buffer method to enhance checkpointing performance.

Findings

01

Increasing threads boosts bandwidth up to 7.8x.

02

Prefetching overlaps computation and I/O, eliminating I/O costs.

03

Burst buffer improves checkpointing speed by 2.6x.

Abstract

The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications. In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.