Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure
Martin Uray, Eduard Hirsch, Gerold Katzinger, Michael, Gadermayr

TL;DR
This paper discusses the challenges and solutions in scaling GPU infrastructure for data science, focusing on decision processes, system architecture, and software stack transformation for on-premises clusters.
Contribution
It presents a detailed case study of designing and implementing a scalable GPU cluster infrastructure for data science applications.
Findings
Identified key challenges in scaling GPU infrastructure.
Proposed a systematic decision process for infrastructure scaling.
Demonstrated a successful transformation of software stack for GPU clusters.
Abstract
Enterprises and labs performing computationally expensive data science applications sooner or later face the problem of scale but unconnected infrastructure. For this up-scaling process, an IT service provider can be hired or in-house personnel can attempt to implement a software stack. The first option can be quite expensive if it is just about connecting several machines. For the latter option often experience is missing with the data science staff in order to navigate through the software jungle. In this technical report, we illustrate the decision process towards an on-premises infrastructure, our implemented system architecture, and the transformation of the software stack towards a scaleable GPU cluster system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Scientific Computing and Data Management
Methodstravel james
