QIGen: Generating Efficient Kernels for Quantized Inference on Large   Language Models

Tommaso Pegolotti; Elias Frantar; Dan Alistarh; Markus P\"uschel

arXiv:2307.03738·cs.LG·July 10, 2023

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus P\"uschel

PDF

Open Access 1 Repo

TL;DR

QIGen is an automatic code generation method that optimizes quantized inference for large language models on CPUs, achieving high performance and accuracy by considering hardware and accuracy constraints.

Contribution

It introduces a novel automatic code generation approach tailored for quantized LLM inference on CPUs, incorporating hardware and accuracy considerations.

Findings

01

Achieves high performance on CPU-based LLaMA inference

02

Maintains high accuracy with quantization constraints

03

Outperforms existing open-source solutions

Abstract

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution. A preliminary implementation is available at https://github.com/IST-DASLab/QIGen.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ist-daslab/qigen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsOPT