Characterizing the Accuracy -- Efficiency Trade-off of Low-rank   Decomposition in Language Models

Chakshu Moar; Faraz Tahmasebi; Michael Pellauer; Hyoukjun Kwon

arXiv:2405.06626·cs.LG·October 24, 2024

Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models

Chakshu Moar, Faraz Tahmasebi, Michael Pellauer, Hyoukjun Kwon

PDF

Open Access

TL;DR

This paper investigates the trade-offs between accuracy and efficiency when applying low-rank Tucker decomposition to large language models, demonstrating significant size reduction with minimal accuracy loss.

Contribution

It formalizes the low-rank decomposition design space for LLMs and provides comprehensive case studies on Llama 2 and BERT models, revealing practical insights.

Findings

01

Achieves up to 9% model size reduction with minimal accuracy loss

02

Shows low-rank decomposition can be used without retraining for efficiency

03

Highlights potential for real-time LLM applications

Abstract

Recent large language models (LLMs) employ billions of parameters to enable broad problem-solving capabilities. Such language models also tend to be memory-bound because of the dominance of matrix-vector and matrix-matrix multiplications with low arithmetic intensity. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today. Model compression methods such as quantization and parameter pruning have been actively explored to achieve memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning (i.e., low-rank decomposition) for LLMs is not well-understood yet. Therefore, in this work, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the low-rank…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · travel james · Attention Is All You Need · Linear Layer · Weight Decay · Multi-Head Attention · Attention Dropout · Dropout · Residual Connection · Softmax