tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads
Steven W. D. Chien, Artur Podobas, Ivy B. Peng, Stefano Markidis

TL;DR
This paper introduces tf-Darshan, a system that extends TensorFlow's profiler with system-level I/O performance analysis using Darshan, enabling detailed insights and optimizations for machine learning workloads on HPC systems.
Contribution
We develop tf-Darshan, integrating Darshan with TensorFlow Profiler for system-level I/O analysis without altering Darshan, and demonstrate its effectiveness through case studies.
Findings
Up to 19% increase in POSIX I/O bandwidth through data staging optimization.
tf-Darshan provides detailed system-level I/O profiling during TensorFlow execution.
Potential for runtime I/O profiling to guide future optimizations.
Abstract
Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
