Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding

Arun Ramachandran; Ramaswamy Govindarajan; Murali Annavaram; Prakash Raghavendra; Hossein Entezari Zarch; Lei Gao; Chaoyi Jiang

arXiv:2511.12031·cs.DC·November 18, 2025

Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding

Arun Ramachandran, Ramaswamy Govindarajan, Murali Annavaram, Prakash Raghavendra, Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang

PDF

Open Access

TL;DR

This paper introduces BMC, a novel KV cache allocation method that balances memory and compute for LLM inference, significantly improving throughput especially when combined with speculative decoding.

Contribution

The paper proposes BMC, a new KV cache allocation mechanism that reduces overhead and enables effective use of speculative decoding for faster LLM inference.

Findings

01

BMC achieves up to 3.2x throughput acceleration over baseline.

02

Combining BMC with SD yields an additional 1.39x speedup.

03

BMC outperforms state-of-the-art inference servers vLLM and DeepSpeed.

Abstract

With the skyrocketing costs of GPUs and their virtual instances in the cloud, there is a significant desire to use CPUs for large language model (LLM) inference. KV cache update, often implemented as allocation, copying, and in-place strided update for each generated token, incurs significant overhead. As the sequence length increases, the allocation and copy overheads dominate the performance. Alternate approaches may allocate large KV tensors upfront to enable in-place updates, but these matrices (with zero-padded rows) cause redundant computations. In this work, we propose a new KV cache allocation mechanism called Balancing Memory and Compute (BMC). BMC allocates, once every r iterations, KV tensors with r redundant rows, allowing in-place update without copy overhead for those iterations, but at the expense of a small amount of redundant computation. Second, we make an interesting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications