A Tensor-Train Decomposition based Compression of LLMs on Group Vector   Systolic Accelerator

Sixiao Huang; Tintin Wang; Ang Li; Ao Shen; Kai Li; Keyao Jiang,; Mingqiang Huang; Hao Yu

arXiv:2501.19135·cs.AR·February 3, 2025

A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator

Sixiao Huang, Tintin Wang, Ang Li, Ao Shen, Kai Li, Keyao Jiang,, Mingqiang Huang, Hao Yu

PDF

Open Access

TL;DR

This paper introduces a tensor-train decomposition method for compressing large language models and implements it on FPGA hardware using a group vector systolic array, significantly reducing delay and resource usage.

Contribution

It develops a novel tensor-train decomposition technique for LLM compression and demonstrates its efficient FPGA implementation with hardware-specific optimizations.

Findings

01

Achieved 1.94× and 1.60× compression ratios for ChatGLM3-6B and LLaMA2-7B.

02

Reduced first token delay by 1.45× and 1.57× on FPGA for the two models.

03

Demonstrated efficient hardware implementation with a group vector systolic array architecture.

Abstract

Large language models (LLMs) are both storage-intensive and computation-intensive, posing significant challenges when deployed on resource-constrained hardware. As linear layers in LLMs are mainly resource consuming parts, this paper develops a tensor-train decomposition (TTD) for LLMs with a further hardware implementation on FPGA. TTD compression is applied to the linear layers in ChatGLM3-6B and LLaMA2-7B models with compression ratios (CRs) for the whole network 1.94 $\times$ and 1.60 $\times$ , respectively. The compressed LLMs are further implemented on FPGA hardware within a highly efficient group vector systolic array (GVSA) architecture, which has DSP-shared parallel vector PEs for TTD inference, as well as optimized data communication in the accelerator. Experimental results show that the corresponding TTD based LLM accelerator implemented on FPGA achieves 1.45 $\times$ and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Tensor decomposition and applications · Power Systems and Technologies