Tender: Accelerating Large Language Models via Tensor Decomposition and   Runtime Requantization

Jungi Lee; Wonbeom Lee; Jaewoong Sim

arXiv:2406.12930·cs.LG·June 21, 2024

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

Jungi Lee, Wonbeom Lee, Jaewoong Sim

PDF

Open Access

TL;DR

Tender is a novel algorithm-hardware co-design approach that accelerates large language model inference using tensor decomposition and low-precision requantization, achieving higher accuracy and efficiency.

Contribution

It introduces a decomposed quantization technique with scale factors as powers of two, enabling efficient low-precision inference without explicit requantization.

Findings

01

Higher inference accuracy compared to state-of-the-art methods.

02

Significantly improved inference performance.

03

Minimal hardware modifications needed.

Abstract

Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges due to the high compute and memory requirements stemming from the enormous model size and the difficulty of running it in the integer pipelines. In this paper, we present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision. Based on our analysis of outlier values in LLMs, we propose a decomposed quantization technique in which the scale factors of decomposed matrices are powers of two apart. The proposed scheme allows us to avoid explicit requantization (i.e., dequantization/quantization) when accumulating the partial sums from the decomposed matrices, with a minimal extension to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational Physics and Python Applications · Tensor decomposition and applications