ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp

Xinhang Chen; Chao Zhang; Jiahuan He; Wei Liu; Jianming Zhang; Wenlong Zhou; Xiao Li; Pai Zeng; Shiyong Li; Yuanpan Qian; Dong Li; Zhaogeng Li

arXiv:2512.10576·cs.DC·December 12, 2025

ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp

Xinhang Chen, Chao Zhang, Jiahuan He, Wei Liu, Jianming Zhang, Wenlong Zhou, Xiao Li, Pai Zeng, Shiyong Li, Yuanpan Qian, Dong Li, Zhaogeng Li

PDF

Open Access

TL;DR

This paper introduces ESS, an offload-centric cache management system that enhances large-context language model inference throughput by offloading cache to CPU, effectively overcoming GPU memory limitations.

Contribution

The paper proposes ESS, a novel offload-centric architecture that improves large-context LLM inference throughput by offloading cache to CPU, addressing GPU memory constraints.

Findings

01

69.4% throughput improvement at 32K context length

02

123% throughput improvement at 128K context length

03

Effective decoupling of batch-size scaling from GPU memory constraints

Abstract

DeepSeek-V3.2-Exp introduces a sparse attention mechanism that significantly reduces inference latency in long-context scenarios. Although the overall throughput has improved greatly, the Decode-stage of PD disaggregation remains to be a major bottleneck. This bottleneck primarily stems from the conflict between linear growth of Latent-Cache with sequence length and the limited GPU memory capacity, which constrains the feasible batch-size and thereby suppresses Decode-stage throughput. To address this challenge, we propose ESS (Extended Sparse Server), an offload-centric system design tailored for DeepSeek-V3.2-Exp. ESS selectively offloads Latent-Cache to CPU memory while preserving latency-critical components on GPU. By freeing up GPU memory, ESS effectively decoupling batch-size scaling from GPU memory constraints. This design significantly improves Decode-stage throughput, thereby…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Big Data and Digital Economy