FullPack: Full Vector Utilization for Sub-Byte Quantized Inference on   General Purpose CPUs

Hossein Katebi; Navidreza Asadi; Maziar Goudarzi

arXiv:2211.06982·cs.PF·November 22, 2022

FullPack: Full Vector Utilization for Sub-Byte Quantized Inference on General Purpose CPUs

Hossein Katebi, Navidreza Asadi, Maziar Goudarzi

PDF

Open Access

TL;DR

This paper introduces a memory-efficient layout and processing methods for sub-byte quantized neural network inference on CPUs, achieving significant speedups over existing techniques and improving performance in real-world applications.

Contribution

It proposes novel memory layouts and compute kernels that fully utilize bits in memory and registers for sub-byte quantization, enhancing inference speed on general-purpose CPUs.

Findings

01

Achieves up to 6.7x speedup for large models.

02

Demonstrates 1.56-2.11x end-to-end speedup on DeepSpeech.

03

Outperforms nine existing methods in detailed evaluations.

Abstract

Although prior art has demonstrated negligible accuracy drop in sub-byte quantization -- where weights and/or activations are represented by less than 8 bits -- popular SIMD instructions of CPUs do not natively support these datatypes. While recent methods, such as ULPPACK, are already using sub-byte quantization on general-purpose CPUs with vector units, they leave out several empty bits between the sub-byte values in memory and in vector registers to avoid overflow to the neighbours during the operations. This results in memory footprint and bandwidth-usage inefficiencies and suboptimal performance. In this paper, we present memory layouts for storing, and mechanisms for processing sub-byte (4-, 2-, or 1-bit) models that utilize all the bits in the memory as well as in the vector registers for the actual data. We provide compute kernels for the proposed layout for the GEMV (GEneral…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Parallel Computing and Optimization Techniques · Advanced Image and Video Retrieval Techniques