FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models

Zishan Shao; Yixiao Wang; Qinsi Wang; Ting Jiang; Zhixu Du; Hancheng Ye; Danyang Zhuo; Yiran Chen; and Hai Li

arXiv:2508.01506·cs.LG·August 5, 2025

FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models

Zishan Shao, Yixiao Wang, Qinsi Wang, Ting Jiang, Zhixu Du, Hancheng Ye, Danyang Zhuo, Yiran Chen, and Hai Li

PDF

Open Access

TL;DR

FlashSVD introduces a streaming inference framework that significantly reduces peak activation memory in SVD-compressed large language models without accuracy loss, enabling efficient on-device deployment.

Contribution

It presents a novel end-to-end streaming approach that fuses low-rank kernels into model pipelines, avoiding full activation materialization and reducing memory overhead.

Findings

01

Peak activation memory reduced by up to 70.2%

02

Intermediate memory reduced by 75%

03

No accuracy loss during inference

Abstract

Singular Value Decomposition (SVD) has recently seen a surge of interest as a simple yet powerful tool for large language models (LLMs) compression, with a growing number of works demonstrating 20-80% parameter reductions at minimal accuracy loss. Previous SVD-based approaches have focused primarily on reducing the memory footprint of model weights, largely overlooking the additional activation memory overhead incurred during inference when applying truncated factors via standard dense CUDA kernels. Our experiments demonstrate that this activation overhead, scaling with sequence length and hidden dimension, prevents current SVD compression techniques from achieving any reduction in peak inference memory, thereby limiting their viability for real-world, on-device deployments. We introduce FlashSVD, a novel, end-to-end rank-aware streaming inference framework specifically designed for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis · Topic Modeling