HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
Mao Lin, Xi Wang, Guilherme Cox, Dong Li, Hyeran Jeon

TL;DR
HybridGen is a novel CPU-GPU hybrid attention framework that improves long-context LLM inference efficiency by addressing memory and bandwidth challenges with innovative parallelism, scheduling, and cache mapping techniques.
Contribution
It introduces a hybrid attention approach leveraging tiered memory systems, enabling better utilization of CPU and GPU resources for long-context LLM inference.
Findings
HybridGen outperforms existing KV cache methods by 1.41x to 3.2x on average.
Experiments on three LLM models and multiple GPU platforms demonstrate significant efficiency gains.
HybridGen maintains high accuracy while improving inference speed.
Abstract
As modern LLMs support thousands to millions of tokens, KV caches grow to hundreds of gigabytes, stressing memory capacity and bandwidth. Existing solutions, such as KV cache pruning and offloading, alleviate these but underutilize hardware by relying solely on either GPU or CPU for attention computing, and considering yet limited CPU local memory for KV cache storage. We propose HybridGen, an efficient hybrid attention framework for long-context LLM inference. HybridGen enables CPU-GPU collaborative attention on systems with expanded tiered memory (e.g., CXL memory), addressing three key challenges: (1) multi-dimensional attention dependencies, (2) intensifying CPU-GPU load imbalance with longer sequences, and (3) NUMA penalty of tiered memories. HybridGen tackles these by introducing attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
