QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models
Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus P\"uschel

TL;DR
QIGen is an automatic code generation method that optimizes quantized inference for large language models on CPUs, achieving high performance and accuracy by considering hardware and accuracy constraints.
Contribution
It introduces a novel automatic code generation approach tailored for quantized LLM inference on CPUs, incorporating hardware and accuracy considerations.
Findings
Achieves high performance on CPU-based LLaMA inference
Maintains high accuracy with quantization constraints
Outperforms existing open-source solutions
Abstract
We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution. A preliminary implementation is available at https://github.com/IST-DASLab/QIGen.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsOPT
