ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition

Keran Zheng; Yinting Huang; Zhewen Yu; Christos-Savvas Bouganis

arXiv:2505.08981·cs.AR·May 15, 2025

ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition

Keran Zheng, Yinting Huang, Zhewen Yu, Christos-Savvas Bouganis

PDF

Open Access

TL;DR

The paper introduces ITERA-LLM, a co-designed framework combining sub-8-bit quantization and iterative tensor decomposition to efficiently compress large language models with minimal accuracy loss.

Contribution

It presents a novel software-hardware co-design approach that integrates low-rank tensor decomposition with quantization, improving compression and efficiency of LLM inference.

Findings

01

Achieves up to 41.1% latency reduction in linear layers.

02

Maintains similar accuracy to quantization-only methods.

03

Provides a hardware-aware optimization process.

Abstract

Recent advancements in Large Language Models (LLMs) have demonstrated impressive capabilities as their scale expands to billions of parameters. Deploying these large-scale models on resource-constrained platforms presents significant challenges, with post-training fixed-point quantization often used as a model compression technique. However, quantization-only methods typically lead to significant accuracy degradation in LLMs when precision falls below 8 bits. This paper addresses this challenge through a software-hardware co-design framework, ITERA-LLM, which integrates sub-8-bit quantization with SVD-based iterative low-rank tensor decomposition for error compensation, leading to higher compression ratios and reduced computational complexity. The proposed approach is complemented by a hardware-aware Design Space Exploration (DSE) process that optimizes accuracy, latency, and resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications · Topic Modeling · Tensor decomposition and applications