Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs
Dibakar Gope, David Mansell, Danny Loh, Ian Bratt

TL;DR
This paper introduces highly optimized kernels and a novel quantization method for LLM inference on Arm CPUs, significantly improving throughput and latency by reducing overhead and better matching weight distributions.
Contribution
The work presents new optimized kernels and a groupwise non-uniform quantization approach tailored for Arm CPUs, enhancing LLM inference efficiency and accuracy.
Findings
3-3.2x faster prompt processing
2x faster autoregressive decoding
Better throughput with ultra-low-precision quantization
Abstract
Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Photolithography Techniques · Magnetic confinement fusion research · Natural Language Processing Techniques
MethodsSparse Evolutionary Training · LLaMA
