SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding

Junming Zhang; Qinyan Zhang; Huajun Sun; Feiyang Gao; Sheng Hu; Rui Nie; Xiangshui Miao

arXiv:2601.10953·cs.AR·January 19, 2026

SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding

Junming Zhang, Qinyan Zhang, Huajun Sun, Feiyang Gao, Sheng Hu, Rui Nie, Xiangshui Miao

PDF

Open Access

TL;DR

SwiftKV introduces a novel attention algorithm and accelerator design optimized for edge devices, significantly improving speed and efficiency in large language model decoding under resource constraints.

Contribution

The paper presents SwiftKV Attention and SwiftKV-MHA, new algorithms and hardware that enable fast, low-latency multi-head attention on edge accelerators without extensive resource use.

Findings

01

7.16x speedup over native attention

02

13.48x latency reduction with SwiftKV-MHA

03

17.4% increase in generation speed

Abstract

Edge acceleration for large language models is crucial for their widespread application; however, achieving fast attention inference and efficient decoding on resource-constrained edge accelerators remains challenging. This paper presents SwiftKV Attention, a per-token pipelined, low-latency single-pass attention inference algorithm, where every (kt, vt) in the KV cache is processed exactly once in a uniform per-token pipeline without score materialization, blockwise softmax, or a second pass, thereby enabling fast execution on edge accelerators with a single hardware set and no resource-intensive parallelism. Furthermore, to address the limited support for multi-head LLM decoding in existing accelerators, we design the SwiftKV-MHA accelerator, which enables high precision attention and low precision GEMV on the same processor array, achieving fast and efficient multi-head parallel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Natural Language Processing Techniques