VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator

Zhican Wang; Hongxiang Fan; Haroon Waris; Gang Wang; Zhenyu Li; Jianfei Jiang; Yanan Sun; Guanghui He

arXiv:2507.00797·cs.AR·July 2, 2025

VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator

Zhican Wang, Hongxiang Fan, Haroon Waris, Gang Wang, Zhenyu Li, Jianfei Jiang, Yanan Sun, Guanghui He

PDF

Open Access

TL;DR

This paper introduces VEDA, a hardware accelerator with novel algorithms and flexible dataflow optimizations that significantly improve the efficiency of large language model inference on edge devices, reducing latency and hardware complexity.

Contribution

It presents a voting-based KV cache eviction algorithm and a flexible dataflow architecture, enabling efficient LLM inference with reduced latency and hardware complexity.

Findings

01

Latency reduced significantly

02

Hardware complexity decreased from O(N) to O(1)

03

Outperforms existing hardware platforms

Abstract

Large Language Models (LLMs) excel in natural language processing tasks but pose significant computational and memory challenges for edge deployment due to their intensive resource demands. This work addresses the efficiency of LLM inference by algorithm-hardware-dataflow tri-optimizations. We propose a novel voting-based KV cache eviction algorithm, balancing hardware efficiency and algorithm accuracy by adaptively identifying unimportant kv vectors. From a dataflow perspective, we introduce a flexible-product dataflow and a runtime reconfigurable PE array for matrix-vector multiplication. The proposed approach effectively handles the diverse dimensional requirements and solves the challenges of incrementally varying sequence lengths. Additionally, an element-serial scheduling scheme is proposed for nonlinear operations, such as softmax and layer normalization (layernorm). Results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques