TL;DR
PipeFusion introduces a patch-level pipeline parallelism technique for diffusion transformers, significantly reducing inference latency and communication costs while improving memory efficiency on multi-GPU setups.
Contribution
It proposes a novel patch-level pipeline parallel strategy that reuses stale feature maps, outperforming existing DiT inference parallelism methods.
Findings
Achieves state-of-the-art performance on multiple diffusion models.
Reduces communication costs compared to tensor and sequence parallelism.
Enhances memory efficiency for large diffusion transformer models.
Abstract
This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and the model layers across multiple GPUs. It employs a patch-level pipeline parallel strategy to orchestrate communication and computation efficiently. By capitalizing on the high similarity between inputs from successive diffusion steps, PipeFusion reuses one-step stale feature maps to provide context for the current pipeline step. This approach notably reduces communication costs compared to existing DiTs inference parallelism, including tensor parallel, sequence parallel and DistriFusion. PipeFusion enhances memory efficiency through parameter distribution across devices, ideal for large DiTs like Flux.1. Experimental results demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
