TL;DR
SPRINT introduces a token dropping method for diffusion transformers that maintains high quality while significantly reducing training and inference costs through a novel fusion of shallow and deep layer outputs.
Contribution
It proposes a simple, effective token dropping technique with residual fusion and a two-stage training schedule to improve efficiency of diffusion transformers.
Findings
Achieves 9.8x training savings on ImageNet-1K 256x256.
Nearly halves FLOPs during inference with Path-Drop Guidance.
Maintains comparable image quality metrics.
Abstract
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet na\"ive strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning…
Peer Reviews
Decision·ICLR 2026 Poster
(1)SPRINT achieves 9.8× training speedup on ImageNet-1K while maintaining comparable quality. The method adds only 0.3% parameters and preserves standard DiT blocks, making it easy to integrate. Strong generalization across architectures (SiT, U-ViT, REPA) demonstrates practical value. (2)Path-Drop Guidance (PDG) halves inference FLOPs while improving quality. Comprehensive experiments reveal complementary roles of sparse-deep and dense-shallow features, providing valuable insights into DiT repr
(1)The paper claims that two-stage training can "close the train-inference gap," but does not quantify how large this gap actually is.
1. Practical Problem: The paper addresses the important issue of quadratic training costs in DiTs, which is highly relevant for the community. 2. Strong Empirical Results: The reported 9.8x training speedup with maintained quality is impressive if valid. 3. Architecture Agnostic: The method appears to work across different architectures (SiT, UViT) and can be combined with other techniques like REPA. 4. Comprehensive Experiments: The paper includes extensive ablations and analysis across multipl
Major Concerns 1. Limited Technical Novelty. The core contribution appears to be a modification of MDTv2, essentially replacing the side-interpolator with simple residual connections. The encoder-middle-decoder architecture is questonable, and the paper fails to provide compelling theoretical or empirical justification for why this specific design should outperform existing methods like MDTv2. 2. Insufficient Comparison with Prior Work. The paper does not adequately explain why SPRINT should b
+ Good performance + The proposed Dense shallow path and sparse deep path can effectively accelerate the training speed.
1. More discussion on Path-Drop Guidance should be included in the Introduction. Currently, the manuscript treats it as merely a supplementary design. 2. The font size in the tables should be consistent.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
