FfDL : A Flexible Multi-tenant Deep Learning Platform
K. R. Jayaram, Vinod Muthusamy, Parijat Dube, Vatche Ishakian, Chen, Wang, Benjamin Herta, Scott Boag, Diana Arroyo, Asser Tantawi, Archit Verma,, Falk Pollok, Rania Khalaf

TL;DR
FfDL is an open-source, flexible, and scalable deep learning platform developed at IBM, designed to efficiently manage large-scale DL training jobs while balancing dependability and performance.
Contribution
This paper introduces FfDL, a novel deep learning platform that integrates dependability, scalability, and flexibility, with comprehensive empirical evaluation and lessons learned from real-world deployment.
Findings
FfDL effectively manages large-scale DL training with minimal overhead.
Empirical results show FfDL's robustness and fault tolerance in real-world scenarios.
Scheduling policies significantly impact performance and resource utilization.
Abstract
Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
