Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse

Hao Liu; Ye Huang; Chenghuan Huang; Zhenyi Zheng; Jiangsu Du; Ziyang Ma; Jing Lyu; Yutong Lu

arXiv:2604.04451·cs.CV·April 7, 2026

Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse

Hao Liu, Ye Huang, Chenghuan Huang, Zhenyi Zheng, Jiangsu Du, Ziyang Ma, Jing Lyu, Yutong Lu

PDF

TL;DR

Chorus is a novel caching method that significantly accelerates video diffusion transformer model serving by reusing computations across different requests, achieving up to 45% speedup.

Contribution

It introduces a three-stage inter-request caching strategy combined with token-guided attention to improve efficiency and semantic alignment in video diffusion models.

Findings

01

Achieves up to 45% speedup on industrial models.

02

Effective reuse of latent features across requests.

03

Enhances semantic alignment with token-guided attention.

Abstract

Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45\% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.