TL;DR
Tempus is a scalable, resource-invariant GEMM streaming framework optimized for AMD Versal AI Edge SoCs, enabling efficient edge inference of large language models with high performance and low resource consumption.
Contribution
It introduces a novel temporal GEMM approach that maintains scalability without hardware expansion, outperforming spatial scaling methods on resource-limited edge devices.
Findings
Achieves 607 GOPS at 10.677 W on-chip power.
211.2x higher prominence factor than spatial SOTA (ARIES).
0.00% utilization of URAM/DSP, with 22.0x core frugality.
Abstract
Scaling laws for Large Language Models (LLMs) establish that model quality improves with computational scale, yet edge deployment imposes strict constraints on compute, memory, and power. Since General Matrix Multiplication (GEMM) accounts for up to 90% of inference time, efficient GEMM acceleration is critical for edge AI. The Adaptive Intelligent Engines available in the AMD Versal adaptive SoCs are well suited for this task, but existing state-of-the-art (SOTA) frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores -- an approach that fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption. We propose Tempus, a Resource-Invariant Temporal GEMM framework for the AMD Versal AI Edge SoC. Rather than expanding hardware resources with matrix size, Tempus employs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
