CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving
Dong Liu, Yanxuan Yu

TL;DR
CXL-SpecKV introduces a disaggregated FPGA-based key-value cache system utilizing CXL interconnects and speculative execution to significantly improve throughput and reduce memory costs in datacenter LLM serving.
Contribution
It presents a novel architecture combining CXL memory disaggregation, FPGA acceleration, and speculative prefetching for efficient LLM cache management.
Findings
Achieves up to 3.2× higher throughput over GPU-only systems.
Reduces memory costs by 2.8× while maintaining accuracy.
Uses FPGA-accelerated compression to cut memory bandwidth by 4×.
Abstract
Large Language Models (LLMs) have revolutionized natural language processing tasks, but their deployment in datacenter environments faces significant challenges due to the massive memory requirements of key-value (KV) caches. During the autoregressive decoding process, KV caches consume substantial GPU memory, limiting batch sizes and overall system throughput. To address these challenges, we propose \textbf{CXL-SpecKV}, a novel disaggregated KV-cache architecture that leverages Compute Express Link (CXL) interconnects and FPGA accelerators to enable efficient speculative execution and memory disaggregation. Our approach introduces three key innovations: (i) a CXL-based memory disaggregation framework that offloads KV-caches to remote FPGA memory with low latency, (ii) a speculative KV-cache prefetching mechanism that predicts and preloads future tokens' cache entries, and (iii) an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Cloud Computing and Resource Management
