CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

Dong Liu; Yanxuan Yu

arXiv:2512.11920·cs.AI·December 16, 2025

CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

Dong Liu, Yanxuan Yu

PDF

Open Access

TL;DR

CXL-SpecKV introduces a disaggregated FPGA-based key-value cache system utilizing CXL interconnects and speculative execution to significantly improve throughput and reduce memory costs in datacenter LLM serving.

Contribution

It presents a novel architecture combining CXL memory disaggregation, FPGA acceleration, and speculative prefetching for efficient LLM cache management.

Findings

01

Achieves up to 3.2× higher throughput over GPU-only systems.

02

Reduces memory costs by 2.8× while maintaining accuracy.

03

Uses FPGA-accelerated compression to cut memory bandwidth by 4×.

Abstract

Large Language Models (LLMs) have revolutionized natural language processing tasks, but their deployment in datacenter environments faces significant challenges due to the massive memory requirements of key-value (KV) caches. During the autoregressive decoding process, KV caches consume substantial GPU memory, limiting batch sizes and overall system throughput. To address these challenges, we propose \textbf{CXL-SpecKV}, a novel disaggregated KV-cache architecture that leverages Compute Express Link (CXL) interconnects and FPGA accelerators to enable efficient speculative execution and memory disaggregation. Our approach introduces three key innovations: (i) a CXL-based memory disaggregation framework that offloads KV-caches to remote FPGA memory with low latency, (ii) a speculative KV-cache prefetching mechanism that predicts and preloads future tokens' cache entries, and (iii) an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Cloud Computing and Resource Management