APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration
Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang

TL;DR
This paper introduces APT-LLM, a novel GPU acceleration scheme for arbitrary-precision large language models that significantly improves inference speed through innovative data formats, matrix multiplication methods, and memory management techniques.
Contribution
The paper presents a comprehensive acceleration framework for arbitrary-precision LLMs, including bipolar-INT data format, bit-level matrix multiplication, and dynamic kernel optimization.
Findings
Achieves up to 3.99× speedup over FP16 on RTX 3090.
Attains 2.16× speedup over NVIDIA CUTLASS INT4 on RTX 3090.
Provides up to 2.44× speedup over FP16 on RTX 4090 and H800.
Abstract
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
