ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition
Keran Zheng, Yinting Huang, Zhewen Yu, Christos-Savvas Bouganis

TL;DR
The paper introduces ITERA-LLM, a co-designed framework combining sub-8-bit quantization and iterative tensor decomposition to efficiently compress large language models with minimal accuracy loss.
Contribution
It presents a novel software-hardware co-design approach that integrates low-rank tensor decomposition with quantization, improving compression and efficiency of LLM inference.
Findings
Achieves up to 41.1% latency reduction in linear layers.
Maintains similar accuracy to quantization-only methods.
Provides a hardware-aware optimization process.
Abstract
Recent advancements in Large Language Models (LLMs) have demonstrated impressive capabilities as their scale expands to billions of parameters. Deploying these large-scale models on resource-constrained platforms presents significant challenges, with post-training fixed-point quantization often used as a model compression technique. However, quantization-only methods typically lead to significant accuracy degradation in LLMs when precision falls below 8 bits. This paper addresses this challenge through a software-hardware co-design framework, ITERA-LLM, which integrates sub-8-bit quantization with SVD-based iterative low-rank tensor decomposition for error compensation, leading to higher compression ratios and reduced computational complexity. The proposed approach is complemented by a hardware-aware Design Space Exploration (DSE) process that optimizes accuracy, latency, and resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Topic Modeling · Tensor decomposition and applications
