A Generic Performance Model for Deep Learning in a Distributed Environment
Tulasi Kavarakuntla, Liangxiu Han, Huw Lloyd, Annabel Latham, Anthony, Kleerekoper, Samson B. Akintoye

TL;DR
This paper introduces a universal performance model for distributed deep learning applications that accurately predicts execution time by considering various intrinsic and extrinsic factors, applicable across different frameworks without code modifications.
Contribution
It presents a generic, adaptable performance model formulated as an optimization problem, validated on multiple frameworks, enhancing understanding and prediction of distributed deep learning performance.
Findings
Accurately predicts execution time across frameworks
Provides insights into performance and scalability factors
Does not require code instrumentation
Abstract
Performance modelling of a deep learning application is essential to improve and quantify the efficiency of the model framework. However, existing performance models are mostly case-specific, with limited capability for the new deep learning frameworks/applications. In this paper, we propose a generic performance model of an application in a distributed environment with a generic expression of the application execution time that considers the influence of both intrinsic factors/operations (e.g. algorithmic parameters/internal operations) and extrinsic scaling factors (e.g. the number of processors, data chunks and batch size). We formulate it as a global optimization problem and solve it using regularization on a cost function and differential evolution algorithm to find the best-fit values of the constants in the generic expression to match the experimentally determined computation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Ferroelectric and Negative Capacitance Devices
