DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
Shouxu Lin, Zhiyuan Guo, Jiaxin Lin

TL;DR
DAK introduces a direct GPU memory offloading framework that significantly improves LLM inference efficiency by optimizing remote memory access, outperforming traditional prefetching methods.
Contribution
It proposes a novel end-to-end direct-access offloading framework with algorithms for optimal offloading ratios and congestion control, leveraging TMA for improved bandwidth utilization.
Findings
DAK achieves up to 3× performance gains on NVLink-C2C.
DAK attains 1.8× performance improvements on PCIe systems.
DAK outperforms existing memory offloading baselines in diverse architectures.
Abstract
LLM inference is constrained by GPU memory capacity and bandwidth. Tiered memory architectures mitigate this by allowing the GPU to offload memory to the remote tier. However, existing memory offloading frameworks rely on prefetching data into local GPU HBM. This approach underutilizes system resources by introducing HBM contention, squandering memory capacity, and creating pipeline bubbles. We show that enabling direct GPU access to remote memory significantly outperforms prefetching, achieving optimal aggregate system bandwidth. We propose DAK, an end-to-end direct-access memory offloading framework that repurposes the Tensor Memory Accelerator (TMA) to asynchronously fetch offloaded weights and KV caches directly from remote memory into GPU shared memory (SMEM). To maximize remote access performance, DAK introduces a greedy algorithm to determine optimal per-operation offloading…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
