ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted   resource clusterS

Federica Filippini; Danilo Ardagna; Marco Lattuada; Edoardo Amaldi,; Michele Ciavotta; Maciek Riedl; Katarzyna Materka; Pawe{\l} Skrzypek,; Fabrizio Magugliani; Marco Cicala

arXiv:2105.05080·cs.DC·May 12, 2021

ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS

Federica Filippini, Danilo Ardagna, Marco Lattuada, Edoardo Amaldi,, Michele Ciavotta, Maciek Riedl, Katarzyna Materka, Pawe{\l} Skrzypek,, Fabrizio Magugliani, Marco Cicala

PDF

TL;DR

ANDREAS is a scheduling system designed to optimize deep learning training workloads on GPU clusters, reducing costs and energy consumption while maintaining performance, validated through simulations and real-world testing.

Contribution

This paper introduces ANDREAS, a novel scheduling approach that jointly optimizes training runtime and energy use in GPU-accelerated clusters, outperforming traditional methods.

Findings

01

Achieves 30-62% cost reduction in simulations

02

Maintains prediction accuracy within 13% in real cluster tests

03

Effectively balances performance and energy efficiency

Abstract

Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range of products and solutions. DL training jobs are highly resource demanding and they experience great benefits when exploiting AI accelerators (e.g., GPUs). However, the effective management of GPU-powered clusters comes with great challenges. Among these, efficient scheduling and resource allocation solutions are crucial to maximize performance and minimize Data Centers operational costs. In this paper we propose ANDREAS, an advanced scheduling solution that tackles these problems jointly, aiming at optimizing DL training runtime workloads and their energy consumption in accelerated clusters. Experiments based on simulation demostrate that we can achieve a cost reduction between 30 and 62% on average with respect to first-principle methods while the validation on a real cluster shows a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.