CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism

Bin Ma; Xingjian Ding; Tekin Bicer; Pengfei Su; Dong Li

arXiv:2604.14561·cs.DC·April 22, 2026

CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism

Bin Ma, Xingjian Ding, Tekin Bicer, Pengfei Su, Dong Li

PDF

TL;DR

CoCoDiff is a novel distributed inference engine for diffusion transformers that significantly reduces communication latency by overlapping computation and communication, leveraging tensor redundancy, and optimizing collective operations.

Contribution

It introduces three mechanisms—TAPA, V-First scheduling, and V-Major—to optimize collective communications and overlapping in distributed DiT inference.

Findings

01

Achieves an average speedup of 3.6x on Aurora supercomputer.

02

Peaks at 8.4x speedup with four DiT models across 1-8 nodes.

03

Effectively overlaps communication with computation, reducing latency.

Abstract

Diffusion Transformers (DiTs) are increasingly adopted in scientific computing, yet growing model sizes and resolutions make distributed multi-GPU inference essential. Ulysses sequence parallelism scales DiT inference but introduces frequent all-to-all collectives that dominate latency. Overlapping these with computation is difficult due to tight data dependencies, large message volumes, and asymmetric interconnect bandwidths. We introduce CoCoDiff, a distributed DiT inference engine exploiting two observations: (1) V requires only linear projection while Q/K need additional normalization and RoPE, creating opportunities to overlap V's communication with Q/K computation; (2) adjacent denoising steps produce similar tensors, yielding temporal redundancy. CoCoDiff introduces three mechanisms: Tile-Aware Parallel All-to-all (TAPA) decomposes collectives into topology-aligned phases;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.