FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

Wenhao Wu; Zishan Shao; Kangning Cui; Jinhee Kim; Yixiao Wang; Hancheng Ye; Danyang Zhuo; Yiran Chen

arXiv:2605.08314·cs.LG·May 12, 2026

FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

Wenhao Wu, Zishan Shao, Kangning Cui, Jinhee Kim, Yixiao Wang, Hancheng Ye, Danyang Zhuo, Yiran Chen

PDF

1 Repo

TL;DR

FlashSVD v1.5 introduces a unified runtime that significantly accelerates SVD-compressed transformer inference, bridging the gap between compression and real-world speedups.

Contribution

It presents a runtime co-design that reorganizes low-rank transformer serving paths, enabling practical speedups across various SVD compression methods.

Findings

01

Achieves up to 2.55x decode speedup and 2.39x end-to-end speedup.

02

Attains 1.48x average decode speedup across multiple SVD families.

03

Demonstrates runtime co-design is essential for practical low-rank acceleration.

Abstract

SVD-based Low-rank compression reduces transformer parameters and nominal FLOPs, but these savings often translate poorly into real LLM serving speedups. We show that this gap is largely a runtime problem: factorized checkpoints fragment execution paths, and the resulting overhead differs substantially between prefill and autoregressive decode. We present FlashSVD v1.5, a unified inference runtime for serving SVD-compressed transformers. FlashSVD v1.5 maps diverse public SVD compression families to a common factorized representation and combines phase-specific kernels with dense-KV decode, packed MLP execution, and per-layer CUDA-graph replay to reorganize the low-rank serving path into a thin runtime. Across representative decoder-serving settings, FlashSVD v1.5 achieves up to 2.55x decode and 2.39x end-to-end speedup, and it attains 1.48x average decode and 1.44x average end-to-end…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Zishan-Shao/FlashSVD
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.