Accelerating Parallel Diffusion Model Serving with Residual Compression
Jiajun Luo, Yicheng Xiao, Jianru Xu, Yangxiu You, Rongwei Lu, Chen Tang, Jingyan Jiang, Zhi Wang

TL;DR
CompactFusion is a novel compression framework that reduces communication overhead in parallel diffusion model inference by transmitting only compressed residuals, enabling faster and higher-quality image and video generation.
Contribution
It introduces Residual Compression for diffusion activations, effectively removing redundancy and supporting scalable, communication-efficient parallel inference without pipeline rework.
Findings
Achieves 3.0x speedup on 4xL20 with high fidelity
Supports sequence parallelism on slow networks, reaching 6.7x speedup
Maintains high generation quality while significantly reducing data transfer
Abstract
Diffusion models produce realistic images and videos but require substantial computational resources, necessitating multi-accelerator parallelism for real-time deployment. However, parallel inference introduces significant communication overhead from exchanging large activations between devices, limiting efficiency and scalability. We present CompactFusion, a compression framework that significantly reduces communication while preserving generation quality. Our key observation is that diffusion activations exhibit strong temporal redundancy-adjacent steps produce highly similar activations, saturating bandwidth with near-duplicate data carrying little new information. To address this inefficiency, we seek a more compact representation that encodes only the essential information. CompactFusion achieves this via Residual Compression that transmits only compressed residuals (step-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFault Detection and Control Systems
