CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

Gunho Park; Jeongin Bae; Byeongwook Kim; Baeseong park; Jiwon Ryu; Hoseung Kim; Se Jung Kwon; Dongsoo Lee

arXiv:2512.17970·cs.LG·December 23, 2025

CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

Gunho Park, Jeongin Bae, Byeongwook Kim, Baeseong park, Jiwon Ryu, Hoseung Kim, Se Jung Kwon, Dongsoo Lee

PDF

Open Access 1 Models

TL;DR

CodeGEMM introduces a codebook-centric GEMM kernel that significantly accelerates quantized LLM inference by replacing dequantization with precomputed partial sums, enhancing speed and efficiency.

Contribution

It proposes a novel GEMM kernel that eliminates dequantization overhead using precomputed inner products, enabling faster and more efficient quantized LLM inference.

Findings

01

Achieves 1.83x speedup on 8B Llama-3 models in 2-bit quantization.

02

Achieves 8.93x speedup on 70B Llama-3 models in 2-bit quantization.

03

Reduces latency and cache pressure compared to existing methods.

Abstract

Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
gunho1123/Llama-3.1-8B-Instruct-Codegemm-m2v8g128
model· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Advanced Data Compression Techniques