CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
Bin Ma, Xingjian Ding, Tekin Bicer, Pengfei Su, Dong Li

TL;DR
CoCoDiff is a novel distributed inference engine for diffusion transformers that significantly reduces communication latency by overlapping computation and communication, leveraging tensor redundancy, and optimizing collective operations.
Contribution
It introduces three mechanisms—TAPA, V-First scheduling, and V-Major—to optimize collective communications and overlapping in distributed DiT inference.
Findings
Achieves an average speedup of 3.6x on Aurora supercomputer.
Peaks at 8.4x speedup with four DiT models across 1-8 nodes.
Effectively overlaps communication with computation, reducing latency.
Abstract
Diffusion Transformers (DiTs) are increasingly adopted in scientific computing, yet growing model sizes and resolutions make distributed multi-GPU inference essential. Ulysses sequence parallelism scales DiT inference but introduces frequent all-to-all collectives that dominate latency. Overlapping these with computation is difficult due to tight data dependencies, large message volumes, and asymmetric interconnect bandwidths. We introduce CoCoDiff, a distributed DiT inference engine exploiting two observations: (1) V requires only linear projection while Q/K need additional normalization and RoPE, creating opportunities to overlap V's communication with Q/K computation; (2) adjacent denoising steps produce similar tensors, yielding temporal redundancy. CoCoDiff introduces three mechanisms: Tile-Aware Parallel All-to-all (TAPA) decomposes collectives into topology-aligned phases;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
