Pushing the Envelope of LLM Inference on AI-PC and Intel GPUs
Evangelos Georganas, Dhiraj Kalamkar, Alexander Heinecke

TL;DR
This paper develops optimized 1-bit and 2-bit microkernels for CPUs and Intel GPUs, significantly improving the efficiency and speed of ultra-low-bit LLM inference on AI PCs and GPUs, enabling resource-efficient deployment.
Contribution
It introduces novel microkernels for ultra-low-bit LLM inference on CPUs and GPUs, achieving state-of-the-art performance and end-to-end speedups over existing runtimes.
Findings
2.2x faster than bitnet.cpp for 2-bit models
Up to 7x speedup over 16-bit inference
4x-8x reduction in GEMM time on Xe GPUs
Abstract
The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Big Data and Digital Economy
