ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference

Qiuyang Zhang; Kai Zhou; Ding Tang; Kai Lu; Cheng Li; Zhenyu Yang; Peng Xu; Jiguang Wan

arXiv:2603.27138·cs.LG·March 31, 2026

ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference

Qiuyang Zhang, Kai Zhou, Ding Tang, Kai Lu, Cheng Li, Zhenyu Yang, Peng Xu, Jiguang Wan

PDF

TL;DR

ScoutAttention is a novel framework that improves large language model inference efficiency by offloading KV cache to CPU with layer-ahead pre-computation and collaborative attention, reducing GPU memory constraints.

Contribution

It introduces a layer-ahead CPU pre-computation algorithm and collaborative sparse attention to accelerate inference while maintaining accuracy.

Findings

01

Achieves 2.1x speedup over existing offloading methods.

02

Maintains inference accuracy within 2.4% of baseline.

03

Reduces CPU load with collaborative sparse attention.

Abstract

Large language models encounter critical GPU memory capacity constraints during long-context inference, where KV cache memory consumption severely limits decode batch sizes. While existing research has explored offloading KV cache to DRAM, these approaches either demand frequent GPU-CPU data transfers or impose extensive CPU computation requirements, resulting in poor GPU utilization as the system waits for I/O operations or CPU processing to complete. We propose ScoutAttention, a novel KV cache offloading framework that accelerates LLM inference through collaborative GPU-CPU attention computation. To prevent CPU computation from bottlenecking the system, ScoutAttention introduces GPU-CPU collaborative block-wise sparse attention that significantly reduces CPU load. Unlike conventional parallel computing approaches, our framework features a novel layer-ahead CPU pre-computation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.