TL;DR
Astraea is a framework that accelerates video diffusion transformers by optimizing token selection and attention strategies, achieving significant speedups with minimal quality loss.
Contribution
Astraea introduces a novel token selection and sparse attention method combined with an evolutionary search for optimal configurations in vDiT-based video generation.
Findings
Up to 2.4× inference speedup on a single GPU
Achieves up to 13.2× speedup on 8 GPUs
Over 10 dB improvement in video quality compared to state-of-the-art
Abstract
Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their high compute demands pose a major challenge for practical deployment. While studies propose acceleration methods to reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation under a performance target. At its core, Astraea proposes a lightweight token selection mechanism and a memory-efficient, GPU-friendly sparse attention strategy, enabling linear savings on execution time with minimal impact on generation quality. Meanwhile, to determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the…
Peer Reviews
Decision·ICLR 2026 Poster
1. Operating at the token level, it offers finer granularity than step- or block-level methods, addressing a previously underexplored dimension. 2. Evolutionary algorithm effectively allocates token budgets across timesteps, reducing reliance on heuristics. 3. Clear evidence of multi-GPU efficiency, which is crucial for industrial deployment.
1. Evolutionary search may still be computationally expensive, limiting practicality in large-scale or frequently updated deployments. 2. Generality of prompts: Search is conducted on a small set of prompts; broader validation on diverse datasets would strengthen claims of generalization.
- The idea is novel and conceptually makes sense. It cleverly leverages the multi-timestep nature of diffusion models for token selection. - The method is supported by solid experiments, and the resulting performance is highly impressive.
Since the paper claims to outperform native sparse attention in line 233 to line 241, would it be possible to include a performance comparison with state-of-the-art methods, such as SVG2 [1]? [1] Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
- **Excellent Scalability**: A key strength of this work is its demonstrated scalability. The framework achieves strong performance scaling across multiple GPUs, showing up to 13.2x speedup on 8 GPUs, which highlights its practical utility for large-scale inference. - **Good Performance Gains**: The proposed method delivers acceptable and noteworthy performance gains, achieving up to 2.4x inference speedup on a single GPU while maintaining high video quality (e.g., <0.5% VBench loss). - **Clea
- **Evolutionary Algorithm (EA) Search Cost**: The EA search for finding the optimal token distribution is computationally expensive, with an average search time of 82 GPU hours and some models taking up to 139 hours. While the authors rightly point out this is an offline cost, this is a significant practical hurdle. Further work could be done to optimize this search or analyze potential inefficiencies in the EA process. - **Lack of Theoretical Justification for EA**: The paper's justification f
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
