ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
Han Meng (University of California, Merced), Danny Willow Liu (University of Chicago), Dong Li (University of California, Merced, Yotta Labs)

TL;DR
ChunkFlow is a novel runtime that optimizes layerwise offloading in distributed diffusion transformer inference by adaptively balancing prefetching and communication, significantly improving speed and memory efficiency.
Contribution
It introduces a co-scheduling model and a communication-aware offloading runtime that adaptively trades GPU memory for prefetch volume, addressing PCIe contention issues.
Findings
Up to 1.28x speedup over existing methods.
Reduces peak GPU memory by up to 49%.
Achieves near-zero overhead in small workloads.
Abstract
Layerwise offloading reduces the GPU memory footprint of large diffusion transformer (DiT) inference by prefetching upcoming layers from host memory, but its effectiveness hinges on hiding prefetch latency behind per-layer computation. This assumption breaks down when the per-GPU compute workload is small. Moreover, on PCIe-only nodes, prefetch and inter-GPU collective communications such as all-reduce and all-to-all contend on the shared PCIe path, exposing prefetch latency even when compute would otherwise hide it. We revisit layerwise offloading as a co-scheduling problem between prefetch and communication, guided by a first-order analytical model that predicts when prefetch can be hidden by computation. Building on this model, we design ChunkFlow, a communication-aware, chunk-granular offloading runtime that adaptively yields to collective communication and smoothly trades GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
