A Latency-Constrained, Gated Recurrent Unit (GRU) Implementation in the Versal AI Engine

M. Sapkas; A. Triossi; M. Zanetti

arXiv:2511.15626·cs.PF·February 3, 2026

A Latency-Constrained, Gated Recurrent Unit (GRU) Implementation in the Versal AI Engine

M. Sapkas, A. Triossi, M. Zanetti

PDF

Open Access

TL;DR

This paper demonstrates a specialized implementation of GRU inference on AMD Xilinx Versal AI Engine, optimizing latency and computational efficiency through custom workload distribution and hybrid hardware design.

Contribution

It introduces a novel workload distribution framework and hybrid AIE-PL architecture for efficient GRU inference on Versal AI Engine.

Findings

01

Achieved reduced latency in GRU inference.

02

Optimized workload distribution across AIE vector processors.

03

Demonstrated improved computational efficiency with hybrid design.

Abstract

This work explores the use of the AMD Xilinx Versal Adaptable Intelligent Engine (AIE) to accelerate Gated Recurrent Unit (GRU) inference for latency constrained applications. We present a custom workload distribution framework across the AIE's vector processors and propose a hybrid AIE - Programmable Logic (PL) design to optimize computational efficiency. Our approach explores the parallelization over the rows of the matrices by utilizing as many of the AIE vectorized processors effectively computing all the elements of the resulting vector at the same time, an alternative to cascade stream pipelining.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Embedded Systems Design Techniques · Advanced Memory and Neural Computing