An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

Feiyu Yao; Zhixiong Niu; Xiaqing Li; Yongqiang Xiong; Juan Fang; Qian Wang

arXiv:2605.07719·cs.LG·May 11, 2026

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

Feiyu Yao, Zhixiong Niu, Xiaqing Li, Yongqiang Xiong, Juan Fang, Qian Wang

PDF

TL;DR

Fluxion introduces a hybrid CPU-GPU sparse attention method that optimizes long-context inference by balancing accuracy and efficiency, significantly speeding up processing over existing fixed sparse baselines.

Contribution

The paper presents Fluxion, a novel co-designed system that combines output-aware KV budgeting, granular sparse configuration, and cross-device execution to improve long-context inference performance.

Findings

01

Fluxion achieves 1.5× to 3.7× speedup over fixed sparse baselines.

02

Fluxion maintains minimal accuracy degradation, with a worst average of -0.26 relative to full attention.

03

The system effectively balances accuracy and efficiency across multiple models and tasks.

Abstract

Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.