EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow
Dogyun Park, Yanyu Li, Sergey Tulyakov, Anil Kag

TL;DR
EFlow introduces an efficient training framework for video diffusion transformers, significantly reducing computation and inference time while maintaining high-quality video generation.
Contribution
The paper presents Gated Local-Global Attention and Path-Drop Guided training, enabling fast few-step video generation from scratch.
Findings
Achieves 2.5x higher training throughput compared to standard solution-flow.
Attains 45.3x lower inference latency than standard iterative models.
Demonstrates competitive performance on Kinetics and large-scale text-to-video datasets.
Abstract
Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
