ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference

Han Meng (University of California; Merced); Danny Willow Liu (University of Chicago); Dong Li (University of California; Merced; Yotta Labs)

arXiv:2605.11335·cs.DC·May 13, 2026

ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference

Han Meng (University of California, Merced), Danny Willow Liu (University of Chicago), Dong Li (University of California, Merced, Yotta Labs)

PDF

TL;DR

ChunkFlow is a novel runtime that optimizes layerwise offloading in distributed diffusion transformer inference by adaptively balancing prefetching and communication, significantly improving speed and memory efficiency.

Contribution

It introduces a co-scheduling model and a communication-aware offloading runtime that adaptively trades GPU memory for prefetch volume, addressing PCIe contention issues.

Findings

01

Up to 1.28x speedup over existing methods.

02

Reduces peak GPU memory by up to 49%.

03

Achieves near-zero overhead in small workloads.

Abstract

Layerwise offloading reduces the GPU memory footprint of large diffusion transformer (DiT) inference by prefetching upcoming layers from host memory, but its effectiveness hinges on hiding prefetch latency behind per-layer computation. This assumption breaks down when the per-GPU compute workload is small. Moreover, on PCIe-only nodes, prefetch and inter-GPU collective communications such as all-reduce and all-to-all contend on the shared PCIe path, exposing prefetch latency even when compute would otherwise hide it. We revisit layerwise offloading as a co-scheduling problem between prefetch and communication, guided by a first-order analytical model that predicts when prefetch can be hidden by computation. Building on this model, we design ChunkFlow, a communication-aware, chunk-granular offloading runtime that adaptively yields to collective communication and smoothly trades GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.