Arax: A Runtime Framework for Decoupling Applications from Heterogeneous   Accelerators

Manos Pavlidakis; Stelios Mavridis; Antony Chazapis; Giorgos; Vasiliadis; and Angelos Bilas

arXiv:2305.01291·eess.SY·May 3, 2023·1 cites

Arax: A Runtime Framework for Decoupling Applications from Heterogeneous Accelerators

Manos Pavlidakis, Stelios Mavridis, Antony Chazapis, Giorgos, Vasiliadis, and Angelos Bilas

PDF

Open Access

TL;DR

Arax is a runtime system that simplifies the use of heterogeneous accelerators by dynamically managing resources, enabling sharing, elasticity, and reducing programming effort with minimal overhead.

Contribution

Arax introduces a dynamic runtime framework that decouples applications from hardware accelerators, supporting resource sharing and elasticity with automatic stub generation.

Findings

01

Applications run with about 12% overhead using Arax.

02

Arax improves accelerator sharing, achieving up to 20% faster execution than NVIDIA MPS.

03

Elasticity support reduces total application turnaround time by up to 2x.

Abstract

Today, using multiple heterogeneous accelerators efficiently from applications and high-level frameworks, such as TensorFlow and Caffe, poses significant challenges in three respects: (a) sharing accelerators, (b) allocating available resources elastically during application execution, and (c) reducing the required programming effort. In this paper, we present Arax, a runtime system that decouples applications from heterogeneous accelerators within a server. First, Arax maps application tasks dynamically to available resources, managing all required task state, memory allocations, and task dependencies. As a result, Arax can share accelerators across applications in a server and adjust the resources used by each application as load fluctuates over time. dditionally, Arax offers a simple API and includes Autotalk, a stub generator that automatically generates stub libraries for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems