A Latency-Constrained, Gated Recurrent Unit (GRU) Implementation in the Versal AI Engine
M. Sapkas, A. Triossi, M. Zanetti

TL;DR
This paper demonstrates a specialized implementation of GRU inference on AMD Xilinx Versal AI Engine, optimizing latency and computational efficiency through custom workload distribution and hybrid hardware design.
Contribution
It introduces a novel workload distribution framework and hybrid AIE-PL architecture for efficient GRU inference on Versal AI Engine.
Findings
Achieved reduced latency in GRU inference.
Optimized workload distribution across AIE vector processors.
Demonstrated improved computational efficiency with hybrid design.
Abstract
This work explores the use of the AMD Xilinx Versal Adaptable Intelligent Engine (AIE) to accelerate Gated Recurrent Unit (GRU) inference for latency constrained applications. We present a custom workload distribution framework across the AIE's vector processors and propose a hybrid AIE - Programmable Logic (PL) design to optimize computational efficiency. Our approach explores the parallelization over the rows of the matrices by utilizing as many of the AIE vectorized processors effectively computing all the elements of the resulting vector at the same time, an alternative to cascade stream pipelining.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Embedded Systems Design Techniques · Advanced Memory and Neural Computing
