TL;DR
Heterogeneous Step Allocation (HSA) is a training-free inference method for diffusion transformers that dynamically assigns denoising steps to tokens based on motion, significantly improving efficiency without quality loss.
Contribution
The paper introduces HSA, a novel inference algorithm that allocates steps heterogeneously to tokens, reducing computation in diffusion video generation without offline profiling.
Findings
HSA outperforms previous caching methods and baseline models at aggressive acceleration levels.
HSA maintains structural integrity and quality under tight computational budgets.
HSA achieves a better quality-runtime trade-off without additional offline profiling.
Abstract
Diffusion Transformers (DiTs) have achieved state-of-the-art video generation quality, but they incur immense computational cost because standard inference applies the same number of denoising steps uniformly to every token in the sequence. It is well known that human vision ignores vast amounts of redundant motion. Why, then, do our densest models treat every spatiotemporal token with equal priority? In this paper, we introduce Heterogeneous Step Allocation (HSA), a training-free inference algorithm that assigns varying step budgets to different spatiotemporal tokens based on their velocity dynamics. To resolve the resulting sequence-length mismatch without sacrificing global context, HSA introduces a KV-cache synchronization mechanism that allows active tokens to attend to the full sequence while entirely bypassing inactive tokens. Furthermore, we derive a cached Euler update that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
