SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs
Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang,, Sameh Gobriel, J. Pablo Mu\~noz, Vui Seng Chua, Nilesh Jain, Mohamed S., Abdelfattah

TL;DR
This paper presents SparAMX, a method leveraging AMX support and unstructured sparsity on Intel CPUs to significantly accelerate large language model token generation, reducing latency and enabling more efficient AI inference.
Contribution
It introduces novel sparse kernels and techniques that accelerate LLM inference on CPUs, combining AMX and unstructured sparsity for the first time in attention computation.
Findings
1.42x reduction in end-to-end latency with sparse linear layers.
1.14x speedup in attention computation without accuracy loss.
Open-source implementation for accelerating PyTorch models.
Abstract
Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with CPUs enables broader AI access at a lower cost and power consumption. This acceleration potential for CPUs is especially relevant during the memory-bound decoding stage of LLM inference, which processes one token at a time and is becoming increasingly utilized with reasoning models. We utilize Advanced Matrix Extensions (AMX) support on the latest Intel CPUs together with unstructured sparsity to achieve a reduction in end-to-end latency compared to the current PyTorch implementation by applying our technique in linear layers. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Advanced Data Storage Technologies
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sparse Evolutionary Training
