Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models
Chakshu Moar, Faraz Tahmasebi, Michael Pellauer, Hyoukjun Kwon

TL;DR
This paper investigates the trade-offs between accuracy and efficiency when applying low-rank Tucker decomposition to large language models, demonstrating significant size reduction with minimal accuracy loss.
Contribution
It formalizes the low-rank decomposition design space for LLMs and provides comprehensive case studies on Llama 2 and BERT models, revealing practical insights.
Findings
Achieves up to 9% model size reduction with minimal accuracy loss
Shows low-rank decomposition can be used without retraining for efficiency
Highlights potential for real-time LLM applications
Abstract
Recent large language models (LLMs) employ billions of parameters to enable broad problem-solving capabilities. Such language models also tend to be memory-bound because of the dominance of matrix-vector and matrix-matrix multiplications with low arithmetic intensity. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today. Model compression methods such as quantization and parameter pruning have been actively explored to achieve memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning (i.e., low-rank decomposition) for LLMs is not well-understood yet. Therefore, in this work, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the low-rank…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTensor decomposition and applications · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · travel james · Attention Is All You Need · Linear Layer · Weight Decay · Multi-Head Attention · Attention Dropout · Dropout · Residual Connection · Softmax
