Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal   ACAP for Deep Learning

Jie Lei; Enrique S. Quintana-Ort\'i

arXiv:2404.15043·cs.DC·April 24, 2024

Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

Jie Lei, Enrique S. Quintana-Ort\'i

PDF

Open Access

TL;DR

This paper presents a parallel GEMM implementation on AMD Versal ACAP, optimizing multi-level memory, vector units, and multi-AIE tile scalability for deep learning inference acceleration.

Contribution

It introduces a novel architecture-specific micro-kernel and a scalable parallel design for GEMM on Versal ACAP, leveraging multiple AI Engines for high throughput.

Findings

01

High parallel scalability with up to 32 AI Engines

02

Efficient use of Versal ACAP's memory hierarchy

03

Micro-kernel optimized for mixed precision arithmetic

Abstract

This paper investigates the design of parallel general matrix multiplication (GEMM) for a Versal Adaptive Compute Accelerated Platform (ACAP) equipped with a VC1902 system-on-chip and multiple Artificial Intelligence Engines (AIEs). Our efforts aim to port standard optimization techniques applied in the high-performance realization of GEMM on CPUs to the Versal ACAP. In particular, 1) we address the flexible exploitation of the Versal ACA multi-level memory hierarchy; 2) we delve into the efficient use of the vector units in the AIE tiles, proposing an architecture-specific micro-kernel for mixed precision arithmetic to address the strong demand for adaptive-precision inference in deep learning; and 3) we introduce a parallel design for GEMM that spans multiple AIE tiles, enhancing the computational throughput. We conduct experimental profiling, with up to 32 AI Engines, that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Brain Tumor Detection and Classification · Neural Networks and Applications