SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

Aurick Qiao; Zhewei Yao; Samyam Rajbhandari; Yuxiong He

arXiv:2410.03960·cs.LG·June 3, 2025

SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He

PDF

Open Access 2 Repos 5 Models 1 Video 3 Reviews

TL;DR

SwiftKV is a novel model transformation technique that reduces prefill computation in large language models, significantly improving inference speed while maintaining high output quality, suitable for enterprise applications.

Contribution

It introduces a cache reuse method and knowledge-preserving distillation to optimize LLM inference, achieving 25-50% prefill FLOPs reduction with minimal accuracy loss.

Findings

01

Reduces prefill computation by 25-50% across LLMs

02

Doubles inference throughput and reduces token generation time by 60%

03

Achieves 560 TFlops/GPU inference throughput for Llama-3.1-70B

Abstract

LLM inference for enterprise applications, such as summarization, RAG, and code-generation, typically observe much longer prompt than generations, leading to high prefill cost and response latency. We present SwiftKV, a novel model transformation and distillation procedure targeted at reducing the prefill compute (in FLOPs) of prompt tokens while preserving high generation quality. First, SwiftKV prefills later layers' KV cache using an earlier layer's output, allowing prompt tokens to skip those later layers. Second, SwiftKV employs a lightweight knowledge-preserving distillation procedure that can adapt existing LLMs with minimal accuracy impact. Third, SwiftKV can naturally incorporate KV cache compression to improve inference performance in low-memory scenarios. Our comprehensive experiments show that SwiftKV can effectively reduce prefill computation by 25-50% across several LLM…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

S1. Inference optimization for in Transformer-based LLMs is an important topic which has been extensively studied in recent years. S2. Several key components have been proposed in this paper, with their usefulness showcased in the evaluation. S3. The proposed method is orthogonal to many existing optimizations and they can be used jointly to further optimize the performance.

Weaknesses

W1. SingleInputKV borrows observations and ideas from previous works, as stated in the submission (such observation has also been utilized in the InfiniGen paper published at OSDI 2024). W2. A core technique in the proposed method is cross-layer KV cache compression. The comparison/discussion with state-of-the-art KV cache compression/merging/cross-layer works is missing, e.g., PyramidKV and infini-attention. It is encouraged to discuss the difference and the novelty compared to existing KV cac

Reviewer 02Rating 6Confidence 4

Strengths

S1. The paper proposes two techniques derived from insights from prior research, demonstrating their efficacy in reducing computational and memory costs during LLM inference. S2. The authors show that fine-tuning can alleviate the decline in benchmark scores, emphasizing the practicality of the proposed methods without notably sacrificing model performance.

Weaknesses

W1. The experiments in the paper are somewhat limited. - The authors evaluate the proposed techniques only on Llama-3.1 models. Testing a wider variety of models would strengthen the results. If the proposed methods could demonstrate their benefits across transformer models with different attention mechanisms (e.g., sparse attention, low-rank attention), scaling approaches (e.g., wide scaling, deep scaling, sparse scaling), and sizes (Llama-3.2-1B, Llama-3.2-3B, Llama-3.2-8B, Llama-3.2.-11B, Ll

Reviewer 03Rating 6Confidence 3

Strengths

The ideas for the various optimizations are presented reasonable clearly and they seem novel as well, especially their combination. The evaluation on a number of models and datasets/benchmarks supports their performance claims and a reasonable ablation study is provided as well.

Weaknesses

For me, the biggest issue is that end-to-end results are missing, which makes it hard for me to put the presented inference results (throughput, latency) into context, which also makes me question how useful the presented numbers are. * apart from SingleInputKV, all the other optimizations are not properly motivated regarding the reasoning why they should work (some form of microbenchmark) * end to end results are missing, especially since some of their writing, if I am not mistaken, suggests t

Code & Models

Repositories

Models

Videos

SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation· underline

Taxonomy

TopicsMachine Learning and Data Classification · Topic Modeling · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Attention Dropout · Linear Layer · Weight Decay · Linear Warmup With Linear Decay · Dropout · Byte Pair Encoding · BERT