FlashVideo: A Framework for Swift Inference in Text-to-Video Generation
Bin Lei, le Chen, Caiwen Ding

TL;DR
FlashVideo introduces a RetNet-based framework for text-to-video generation that significantly accelerates inference speed, making it practical for real-time applications by reducing complexity and improving efficiency.
Contribution
The paper presents the first adaptation of RetNet architecture for video generation, achieving faster inference with reduced complexity and a novel frame interpolation method.
Findings
Achieves 9.17x efficiency improvement over traditional autoregressive models
Reduces inference complexity from O(L^2) to O(L)
Inference speed comparable to BERT-based transformers
Abstract
In the evolving field of machine learning, video generation has witnessed significant advancements with autoregressive-based transformer models and diffusion models, known for synthesizing dynamic and realistic scenes. However, these models often face challenges with prolonged inference times, even for generating short video clips such as GIFs. This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation. FlashVideo represents the first successful adaptation of the RetNet architecture for video generation, bringing a unique approach to the field. Leveraging the RetNet-based architecture, FlashVideo reduces the time complexity of inference from to for a sequence of length , significantly accelerating inference speed. Additionally, we adopt a redundant-free frame interpolation method, enhancing the efficiency of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Music and Audio Processing
MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
