TL;DR
SwiftKV is a novel model transformation technique that reduces prefill computation in large language models, significantly improving inference speed while maintaining high output quality, suitable for enterprise applications.
Contribution
It introduces a cache reuse method and knowledge-preserving distillation to optimize LLM inference, achieving 25-50% prefill FLOPs reduction with minimal accuracy loss.
Findings
Reduces prefill computation by 25-50% across LLMs
Doubles inference throughput and reduces token generation time by 60%
Achieves 560 TFlops/GPU inference throughput for Llama-3.1-70B
Abstract
LLM inference for enterprise applications, such as summarization, RAG, and code-generation, typically observe much longer prompt than generations, leading to high prefill cost and response latency. We present SwiftKV, a novel model transformation and distillation procedure targeted at reducing the prefill compute (in FLOPs) of prompt tokens while preserving high generation quality. First, SwiftKV prefills later layers' KV cache using an earlier layer's output, allowing prompt tokens to skip those later layers. Second, SwiftKV employs a lightweight knowledge-preserving distillation procedure that can adapt existing LLMs with minimal accuracy impact. Third, SwiftKV can naturally incorporate KV cache compression to improve inference performance in low-memory scenarios. Our comprehensive experiments show that SwiftKV can effectively reduce prefill computation by 25-50% across several LLM…
Peer Reviews
Decision·Submitted to ICLR 2025
S1. Inference optimization for in Transformer-based LLMs is an important topic which has been extensively studied in recent years. S2. Several key components have been proposed in this paper, with their usefulness showcased in the evaluation. S3. The proposed method is orthogonal to many existing optimizations and they can be used jointly to further optimize the performance.
W1. SingleInputKV borrows observations and ideas from previous works, as stated in the submission (such observation has also been utilized in the InfiniGen paper published at OSDI 2024). W2. A core technique in the proposed method is cross-layer KV cache compression. The comparison/discussion with state-of-the-art KV cache compression/merging/cross-layer works is missing, e.g., PyramidKV and infini-attention. It is encouraged to discuss the difference and the novelty compared to existing KV cac
S1. The paper proposes two techniques derived from insights from prior research, demonstrating their efficacy in reducing computational and memory costs during LLM inference. S2. The authors show that fine-tuning can alleviate the decline in benchmark scores, emphasizing the practicality of the proposed methods without notably sacrificing model performance.
W1. The experiments in the paper are somewhat limited. - The authors evaluate the proposed techniques only on Llama-3.1 models. Testing a wider variety of models would strengthen the results. If the proposed methods could demonstrate their benefits across transformer models with different attention mechanisms (e.g., sparse attention, low-rank attention), scaling approaches (e.g., wide scaling, deep scaling, sparse scaling), and sizes (Llama-3.2-1B, Llama-3.2-3B, Llama-3.2-8B, Llama-3.2.-11B, Ll
The ideas for the various optimizations are presented reasonable clearly and they seem novel as well, especially their combination. The evaluation on a number of models and datasets/benchmarks supports their performance claims and a reasonable ablation study is provided as well.
For me, the biggest issue is that end-to-end results are missing, which makes it hard for me to put the presented inference results (throughput, latency) into context, which also makes me question how useful the presented numbers are. * apart from SingleInputKV, all the other optimizations are not properly motivated regarding the reasoning why they should work (some form of microbenchmark) * end to end results are missing, especially since some of their writing, if I am not mistaken, suggests t
Code & Models
- 🤗Snowflake/Llama-3.1-SwiftKV-8B-Instructmodel· 449 dl· ♡ 8449 dl♡ 8
- 🤗Snowflake/Llama-3.1-SwiftKV-405B-Instruct-FP8model· 6 dl6 dl
- 🤗Snowflake/Llama-3.1-SwiftKV-8B-Instruct-FP8model· 79 dl· ♡ 179 dl♡ 1
- 🤗Snowflake/Llama-3.3-SwiftKV-70B-Instructmodel· 469 dl· ♡ 2469 dl♡ 2
- 🤗Snowflake/Llama-3.3-SwiftKV-70B-Instruct-FP8model· 1 dl1 dl
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Topic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Attention Dropout · Linear Layer · Weight Decay · Linear Warmup With Linear Decay · Dropout · Byte Pair Encoding · BERT
