KnapFormer: An Online Load Balancer for Efficient Diffusion Transformers Training
Kai Zhang, Peng Wang, Sai Bi, Jianming Zhang, Yuanjun Xiong

TL;DR
KnapFormer is a novel load balancing framework that optimizes distributed training of Diffusion Transformers by effectively managing token imbalance, leading to significant speedups and reduced stragglers.
Contribution
It introduces a global knapsack-based token redistribution method integrated with sequence parallelism for efficient Diffusion Transformer training.
Findings
Achieves less than 1% workload discrepancy across GPUs.
Realizes 2x to 3x speedup in training diffusion models.
Reduces straggler effects in distributed training.
Abstract
We present KnapFormer, an efficient and versatile framework to combine workload balancing and sequence parallelism in distributed training of Diffusion Transformers (DiT). KnapFormer builds on the insight that strong synergy exists between sequence parallelism and the need to address the significant token imbalance across ranks. This imbalance arises from variable-length text inputs and varying visual token counts in mixed-resolution and image-video joint training. KnapFormer redistributes tokens by first gathering sequence length metadata across all ranks in a balancing group and solving a global knapsack problem. The solver aims to minimize the variances of total workload per-GPU, while accounting for the effect of sequence parallelism. By integrating DeepSpeed-Ulysees-based sequence parallelism in the load-balancing decision process and utilizing a simple semi-empirical workload…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Practical problem & clear motivation.** Token / visual-token heterogeneity is real in multimodal diffusion training; addressing stragglers is valuable for throughput and cost. The paper identifies an important engineering bottleneck and proposes an end-to-end solution. - **Simple, usable design.** The compute-bag abstraction and compact topology spec (e.g., g1n32+g2n16...) are intuitive and likely easy to plug into existing PyTorch/DeepSpeed pipelines; API snippets strengthen reproduci
- **Simplifying assumptions in workload model.** The latency model is FLOP-based with an empirical γ correction. While practical, it is hardware and kernel specific; the paper fits γ for H100 only and does not show sensitivity to γ, batch size, or kernel implementations. If γ changes (different GPU, FlashAttention version, different head/channel layouts), the knapsack decisions could be suboptimal. There is little robustness analysis. - **Algorithmic / theoretical gaps.** The assignment use
1. Well-motivated system problem: The introduction clearly articulates token-length heterogeneity as a bottleneck in DiT training, connecting it to variable-length text inputs and varying visual token counts in mixed-resolution and image-video joint training. 2. Methodological clarity: The paper provides explicit pseudocode-style descriptions and API examples, showing integration points both outside and inside transformer blocks. This improves reproducibility and readability. 3. Minimal commun
1. Empirical evaluation limited to synthetic workloads: Section 4.1 explicitly notes that results are from “a training simulator for text-to-{image, video} diffusion training”. While realistic distributions are simulated, no actual end-to-end training curves (e.g., loss vs. steps) or validation throughput on real datasets are provided. This limits evidence that the gains translate to actual pretraining pipelines. 2. Missing baseline comparisons with concurrent work: In Related Work (L160-L200),
- Proposes an online redistribution of tokens to reduce stragglers when multimodal inputs create large sequence-length variance across GPUs - By integrating DeepSpeed-Ulysees-based sequence parallelism in the load-balancing decision process, it achieves minimal communication overhead and less than 1% workload discrepancy - Claimed performance improvements depend on fast intra-node links (NVLink) and a particular parallelism layout. It seems the payoff is infrastructure-dependent, thus less scien
- It benefits most when fast intra-node links (e.g., NVLink) are available since cross-node bandwidth/latency can erode the gains. - It does not look like a good fit to ICLR. A load-balancing/throughput optimizer tied to a specific training stack reads as systems engineering, which typically fits MLSys/OSDI/EuroSys/SOSP better than ICLR. - Claimed results depend on fast intra-node links (NVLink) and a particular parallelism layout. - If improvements come purely from token routing and communic
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
