SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding
Junming Zhang, Qinyan Zhang, Huajun Sun, Feiyang Gao, Sheng Hu, Rui Nie, Xiangshui Miao

TL;DR
SwiftKV introduces a novel attention algorithm and accelerator design optimized for edge devices, significantly improving speed and efficiency in large language model decoding under resource constraints.
Contribution
The paper presents SwiftKV Attention and SwiftKV-MHA, new algorithms and hardware that enable fast, low-latency multi-head attention on edge accelerators without extensive resource use.
Findings
7.16x speedup over native attention
13.48x latency reduction with SwiftKV-MHA
17.4% increase in generation speed
Abstract
Edge acceleration for large language models is crucial for their widespread application; however, achieving fast attention inference and efficient decoding on resource-constrained edge accelerators remains challenging. This paper presents SwiftKV Attention, a per-token pipelined, low-latency single-pass attention inference algorithm, where every (kt, vt) in the KV cache is processed exactly once in a uniform per-token pipeline without score materialization, blockwise softmax, or a second pass, thereby enabling fast execution on edge accelerators with a single hardware set and no resource-intensive parallelism. Furthermore, to address the limited support for multi-head LLM decoding in existing accelerators, we design the SwiftKV-MHA accelerator, which enables high precision attention and low precision GEMV on the same processor array, achieving fast and efficient multi-head parallel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Natural Language Processing Techniques
