An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
Feiyu Yao, Zhixiong Niu, Xiaqing Li, Yongqiang Xiong, Juan Fang, Qian Wang

TL;DR
Fluxion introduces a hybrid CPU-GPU sparse attention method that optimizes long-context inference by balancing accuracy and efficiency, significantly speeding up processing over existing fixed sparse baselines.
Contribution
The paper presents Fluxion, a novel co-designed system that combines output-aware KV budgeting, granular sparse configuration, and cross-device execution to improve long-context inference performance.
Findings
Fluxion achieves 1.5× to 3.7× speedup over fixed sparse baselines.
Fluxion maintains minimal accuracy degradation, with a worst average of -0.26 relative to full attention.
The system effectively balances accuracy and efficiency across multiple models and tasks.
Abstract
Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
