Pushing the Envelope of LLM Inference on AI-PC and Intel GPUs

Evangelos Georganas; Dhiraj Kalamkar; Alexander Heinecke

arXiv:2508.06753·cs.AI·January 27, 2026

Pushing the Envelope of LLM Inference on AI-PC and Intel GPUs

Evangelos Georganas, Dhiraj Kalamkar, Alexander Heinecke

PDF

Open Access

TL;DR

This paper develops optimized 1-bit and 2-bit microkernels for CPUs and Intel GPUs, significantly improving the efficiency and speed of ultra-low-bit LLM inference on AI PCs and GPUs, enabling resource-efficient deployment.

Contribution

It introduces novel microkernels for ultra-low-bit LLM inference on CPUs and GPUs, achieving state-of-the-art performance and end-to-end speedups over existing runtimes.

Findings

01

2.2x faster than bitnet.cpp for 2-bit models

02

Up to 7x speedup over 16-bit inference

03

4x-8x reduction in GEMM time on Xe GPUs

Abstract

The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Big Data and Digital Economy