APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs

Jiakun Fan; Yanglin Zhang; Xiangchen Li; Dimitrios S. Nikolopoulos

arXiv:2506.03296·cs.DC·January 16, 2026

APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs

Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos

PDF

Open Access

TL;DR

APEX introduces a dynamic, profiling-informed scheduling strategy that enhances CPU-GPU parallelism for large language model inference, significantly improving throughput while maintaining latency on constrained GPU hardware.

Contribution

It presents APEX, a novel scheduler that predicts execution times to optimize CPU-GPU task overlap during hybrid LLM inference, outperforming existing static and heuristic methods.

Findings

01

APEX improves throughput by up to 96% on T4 GPUs.

02

APEX achieves up to 89% higher throughput on A10 GPUs.

03

APEX maintains latency while significantly increasing efficiency.

Abstract

Deploying large language models (LLMs) for online inference is often constrained by limited GPU memory, particularly due to the growing KV cache during auto-regressive decoding. Hybrid GPU-CPU execution has emerged as a promising solution by offloading KV cache management and parts of attention computation to the CPU. However, a key bottleneck remains: existing schedulers fail to effectively overlap CPU-offloaded tasks with GPU execution during the latency-critical, bandwidth-bound decode phase. This particularly penalizes real-time, decode-heavy applications (e.g., chat, Chain-of-Thought reasoning) which are currently underserved by existing systems, especially under memory pressure typical of edge or low-cost deployments. We present APEX, a novel, profiling-informed scheduling strategy that maximizes CPU-GPU parallelism during hybrid LLM inference. Unlike systems relying on static…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems