Accelerating Parallel Diffusion Model Serving with Residual Compression

Jiajun Luo; Yicheng Xiao; Jianru Xu; Yangxiu You; Rongwei Lu; Chen Tang; Jingyan Jiang; Zhi Wang

arXiv:2507.17511·cs.CV·December 1, 2025

Accelerating Parallel Diffusion Model Serving with Residual Compression

Jiajun Luo, Yicheng Xiao, Jianru Xu, Yangxiu You, Rongwei Lu, Chen Tang, Jingyan Jiang, Zhi Wang

PDF

Open Access 1 Video

TL;DR

CompactFusion is a novel compression framework that reduces communication overhead in parallel diffusion model inference by transmitting only compressed residuals, enabling faster and higher-quality image and video generation.

Contribution

It introduces Residual Compression for diffusion activations, effectively removing redundancy and supporting scalable, communication-efficient parallel inference without pipeline rework.

Findings

01

Achieves 3.0x speedup on 4xL20 with high fidelity

02

Supports sequence parallelism on slow networks, reaching 6.7x speedup

03

Maintains high generation quality while significantly reducing data transfer

Abstract

Diffusion models produce realistic images and videos but require substantial computational resources, necessitating multi-accelerator parallelism for real-time deployment. However, parallel inference introduces significant communication overhead from exchanging large activations between devices, limiting efficiency and scalability. We present CompactFusion, a compression framework that significantly reduces communication while preserving generation quality. Our key observation is that diffusion activations exhibit strong temporal redundancy-adjacent steps produce highly similar activations, saturating bandwidth with near-duplicate data carrying little new information. To address this inefficiency, we seek a more compact representation that encodes only the essential information. CompactFusion achieves this via Residual Compression that transmits only compressed residuals (step-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Accelerating Parallel Diffusion Model Serving with Residual Compression· slideslive

Taxonomy

TopicsFault Detection and Control Systems