HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference
Weishu Deng, Yujie Yang, Peiran Du, Lingfeng Xiang, Zhen Lin, Chen Zhong, Song Jiang, Hui Lu, Jia Rao

TL;DR
HGCA introduces a hybrid CPU-GPU attention mechanism that enables scalable, high-quality long-context LLM inference by combining dense and sparse attention, optimizing resource utilization and maintaining accuracy.
Contribution
It presents a novel hybrid attention approach that efficiently combines dense GPU attention with sparse CPU attention, supporting longer sequences without retraining.
Findings
Achieves higher throughput and scalability on commodity hardware.
Supports longer context lengths and larger batch sizes.
Outperforms existing sparse attention methods in accuracy and performance.
Abstract
Scaling inference for large language models (LLMs) is increasingly constrained by limited GPU memory, especially due to growing key-value (KV) caches required for long-context generation. While existing approaches offload KV caches to CPU memory or apply sparse attention to reduce GPU load, they often underutilize CPU compute resources and compromise accuracy. We present HGCA, a hybrid CPU-GPU attention mechanism that enables scalable, high-throughput LLM inference with near-full attention quality. HGCA performs dense attention on recently generated KV entries retained in GPU memory and parallel sparse attention on selected, salient KV entries in CPU memory. The attention outputs are efficiently merged using log-sum-exp fusion, minimizing PCIe transfer overhead. HGCA also introduces a finegrained, per-head sparsification strategy optimized for CPU execution, preserving contextual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
