Efficient LLM Inference with Activation Checkpointing and Hybrid Caching

Sanghyeon Lee; Hongbeen Kim; Soojin Hwang; Guseul Heo; Minwoo Noh; Jaehyuk Huh

arXiv:2501.01792·cs.DC·February 3, 2026

Efficient LLM Inference with Activation Checkpointing and Hybrid Caching

Sanghyeon Lee, Hongbeen Kim, Soojin Hwang, Guseul Heo, Minwoo Noh, Jaehyuk Huh

PDF

TL;DR

This paper introduces HybridServe, an LLM inference system that uses activation checkpointing and hybrid caching to significantly improve throughput by efficiently managing memory and computation during model offloading.

Contribution

The paper proposes a novel hybrid caching scheme combining activation and KV caches, enabling faster recomputation and improved GPU utilization during LLM inference.

Findings

01

Achieves 2.19x throughput improvement over prior methods.

02

Effectively balances activation recomputation and parameter loading.

03

Reduces inference time by optimizing cache management.

Abstract

Recent large language models (LLMs) with enormous model sizes use many GPUs to meet memory capacity requirements incurring substantial costs for token generation. To provide cost-effective LLM inference with relaxed latency constraints, extensive research has focused on expanding GPU memory by leveraging the host memory. However, LLM inference engines that utilize the host memory often face underutilization of GPU compute units, as a considerable portion of inference time is spent in loading the model onto the GPU via host-GPU interconnect. To tackle these challenges of the host memory offloading for LLM, we introduce HybridServe, an LLM inference system with activation checkpointing based on activation caching. The activation cache stores activation checkpoints generated during intermediate inference stages, allowing the fast recomputation of KV cache while model parameters are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.