APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs
Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos

TL;DR
APEX introduces a dynamic, profiling-informed scheduling strategy that enhances CPU-GPU parallelism for large language model inference, significantly improving throughput while maintaining latency on constrained GPU hardware.
Contribution
It presents APEX, a novel scheduler that predicts execution times to optimize CPU-GPU task overlap during hybrid LLM inference, outperforming existing static and heuristic methods.
Findings
APEX improves throughput by up to 96% on T4 GPUs.
APEX achieves up to 89% higher throughput on A10 GPUs.
APEX maintains latency while significantly increasing efficiency.
Abstract
Deploying large language models (LLMs) for online inference is often constrained by limited GPU memory, particularly due to the growing KV cache during auto-regressive decoding. Hybrid GPU-CPU execution has emerged as a promising solution by offloading KV cache management and parts of attention computation to the CPU. However, a key bottleneck remains: existing schedulers fail to effectively overlap CPU-offloaded tasks with GPU execution during the latency-critical, bandwidth-bound decode phase. This particularly penalizes real-time, decode-heavy applications (e.g., chat, Chain-of-Thought reasoning) which are currently underserved by existing systems, especially under memory pressure typical of edge or low-cost deployments. We present APEX, a novel, profiling-informed scheduling strategy that maximizes CPU-GPU parallelism during hybrid LLM inference. Unlike systems relying on static…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
