Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on   Arm CPUs

Dibakar Gope; David Mansell; Danny Loh; Ian Bratt

arXiv:2501.00032·cs.LG·January 3, 2025

Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs

Dibakar Gope, David Mansell, Danny Loh, Ian Bratt

PDF

Open Access 1 Repo

TL;DR

This paper introduces highly optimized kernels and a novel quantization method for LLM inference on Arm CPUs, significantly improving throughput and latency by reducing overhead and better matching weight distributions.

Contribution

The work presents new optimized kernels and a groupwise non-uniform quantization approach tailored for Arm CPUs, enhancing LLM inference efficiency and accuracy.

Findings

01

3-3.2x faster prompt processing

02

2x faster autoregressive decoding

03

Better throughput with ultra-low-precision quantization

Abstract

Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ggerganov/llama.cpp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Photolithography Techniques · Magnetic confinement fusion research · Natural Language Processing Techniques

MethodsSparse Evolutionary Training · LLaMA