SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered   CPUs

Ahmed F. AbouElhamayed; Jordan Dotzel; Yash Akhauri; Chi-Chih Chang,; Sameh Gobriel; J. Pablo Mu\~noz; Vui Seng Chua; Nilesh Jain; Mohamed S.; Abdelfattah

arXiv:2502.12444·cs.LG·February 19, 2025

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang,, Sameh Gobriel, J. Pablo Mu\~noz, Vui Seng Chua, Nilesh Jain, Mohamed S., Abdelfattah

PDF

Open Access 1 Repo

TL;DR

This paper presents SparAMX, a method leveraging AMX support and unstructured sparsity on Intel CPUs to significantly accelerate large language model token generation, reducing latency and enabling more efficient AI inference.

Contribution

It introduces novel sparse kernels and techniques that accelerate LLM inference on CPUs, combining AMX and unstructured sparsity for the first time in attention computation.

Findings

01

1.42x reduction in end-to-end latency with sparse linear layers.

02

1.14x speedup in attention computation without accuracy loss.

03

Open-source implementation for accelerating PyTorch models.

Abstract

Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with CPUs enables broader AI access at a lower cost and power consumption. This acceleration potential for CPUs is especially relevant during the memory-bound decoding stage of LLM inference, which processes one token at a time and is becoming increasingly utilized with reasoning models. We utilize Advanced Matrix Extensions (AMX) support on the latest Intel CPUs together with unstructured sparsity to achieve a $1.42 \times$ reduction in end-to-end latency compared to the current PyTorch implementation by applying our technique in linear layers. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intellabs/hardware-aware-automated-machine-learning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Advanced Data Storage Technologies

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sparse Evolutionary Training