Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization
Jungi Lee, Wonbeom Lee, Jaewoong Sim

TL;DR
Tender is a novel algorithm-hardware co-design approach that accelerates large language model inference using tensor decomposition and low-precision requantization, achieving higher accuracy and efficiency.
Contribution
It introduces a decomposed quantization technique with scale factors as powers of two, enabling efficient low-precision inference without explicit requantization.
Findings
Higher inference accuracy compared to state-of-the-art methods.
Significantly improved inference performance.
Minimal hardware modifications needed.
Abstract
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges due to the high compute and memory requirements stemming from the enormous model size and the difficulty of running it in the integer pipelines. In this paper, we present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision. Based on our analysis of outlier values in LLMs, we propose a decomposed quantization technique in which the scale factors of decomposed matrices are powers of two apart. The proposed scheme allows us to avoid explicit requantization (i.e., dequantization/quantization) when accumulating the partial sums from the decomposed matrices, with a minimal extension to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational Physics and Python Applications · Tensor decomposition and applications
