StreamFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs

Jiacheng Yang; Jun Wu; Yaoyao Ding; Zhiying Xu; Yida Wang; Gennady Pekhimenko

arXiv:2601.20273·cs.DC·January 29, 2026

StreamFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs

Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, Gennady Pekhimenko

PDF

Open Access

TL;DR

StreamFusion introduces a topology-aware, efficient sequence parallelism framework for distributed diffusion transformer inference on GPUs, overcoming communication bottlenecks and synchronization overheads to significantly improve performance.

Contribution

It proposes a novel topology-aware sequence parallelism method, Torus Attention, and a one-sided communication approach, enhancing distributed inference efficiency for diffusion transformers.

Findings

01

Achieves up to 1.77x speedup over state-of-the-art methods.

02

Reduces communication and synchronization overheads in distributed GPU inference.

03

Demonstrates improved scalability for high-resolution image and video generation.

Abstract

Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques