Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers

Haosong Liu; Yuge Cheng; Wenxuan Miao; Zihan Liu; Aiyue Chen; Jing Lin; Yiwu Yao; Chen Chen; Jingwen Leng; Yu Feng; Minyi Guo

arXiv:2506.05096·cs.CV·September 29, 2025

Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers

Haosong Liu, Yuge Cheng, Wenxuan Miao, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, Minyi Guo

PDF

3 Reviews

TL;DR

Astraea is a framework that accelerates video diffusion transformers by optimizing token selection and attention strategies, achieving significant speedups with minimal quality loss.

Contribution

Astraea introduces a novel token selection and sparse attention method combined with an evolutionary search for optimal configurations in vDiT-based video generation.

Findings

01

Up to 2.4× inference speedup on a single GPU

02

Achieves up to 13.2× speedup on 8 GPUs

03

Over 10 dB improvement in video quality compared to state-of-the-art

Abstract

Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their high compute demands pose a major challenge for practical deployment. While studies propose acceleration methods to reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation under a performance target. At its core, Astraea proposes a lightweight token selection mechanism and a memory-efficient, GPU-friendly sparse attention strategy, enabling linear savings on execution time with minimal impact on generation quality. Meanwhile, to determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. Operating at the token level, it offers finer granularity than step- or block-level methods, addressing a previously underexplored dimension. 2. Evolutionary algorithm effectively allocates token budgets across timesteps, reducing reliance on heuristics. 3. Clear evidence of multi-GPU efficiency, which is crucial for industrial deployment.

Weaknesses

1. Evolutionary search may still be computationally expensive, limiting practicality in large-scale or frequently updated deployments. 2. Generality of prompts: Search is conducted on a small set of prompts; broader validation on diverse datasets would strengthen claims of generalization.

Reviewer 02Rating 6Confidence 2

Strengths

- The idea is novel and conceptually makes sense. It cleverly leverages the multi-timestep nature of diffusion models for token selection. - The method is supported by solid experiments, and the resulting performance is highly impressive.

Weaknesses

Since the paper claims to outperform native sparse attention in line 233 to line 241, would it be possible to include a performance comparison with state-of-the-art methods, such as SVG2 [1]? [1] Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Reviewer 03Rating 4Confidence 3

Strengths

- **Excellent Scalability**: A key strength of this work is its demonstrated scalability. The framework achieves strong performance scaling across multiple GPUs, showing up to 13.2x speedup on 8 GPUs, which highlights its practical utility for large-scale inference. - **Good Performance Gains**: The proposed method delivers acceptable and noteworthy performance gains, achieving up to 2.4x inference speedup on a single GPU while maintaining high video quality (e.g., <0.5% VBench loss). - **Clea

Weaknesses

- **Evolutionary Algorithm (EA) Search Cost**: The EA search for finding the optimal token distribution is computationally expensive, with an average search time of 82 GPU hours and some models taking up to 139 hours. While the authors rightly point out this is an offline cost, this is a significant practical hurdle. Further work could be done to optimize this search or analyze potential inefficiencies in the EA process. - **Lack of Theoretical Justification for EA**: The paper's justification f

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion