FlashVideo: A Framework for Swift Inference in Text-to-Video Generation

Bin Lei; le Chen; Caiwen Ding

arXiv:2401.00869·cs.CV·January 3, 2024·1 cites

FlashVideo: A Framework for Swift Inference in Text-to-Video Generation

Bin Lei, le Chen, Caiwen Ding

PDF

Open Access

TL;DR

FlashVideo introduces a RetNet-based framework for text-to-video generation that significantly accelerates inference speed, making it practical for real-time applications by reducing complexity and improving efficiency.

Contribution

The paper presents the first adaptation of RetNet architecture for video generation, achieving faster inference with reduced complexity and a novel frame interpolation method.

Findings

01

Achieves 9.17x efficiency improvement over traditional autoregressive models

02

Reduces inference complexity from O(L^2) to O(L)

03

Inference speed comparable to BERT-based transformers

Abstract

In the evolving field of machine learning, video generation has witnessed significant advancements with autoregressive-based transformer models and diffusion models, known for synthesizing dynamic and realistic scenes. However, these models often face challenges with prolonged inference times, even for generating short video clips such as GIFs. This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation. FlashVideo represents the first successful adaptation of the RetNet architecture for video generation, bringing a unique approach to the field. Leveraging the RetNet-based architecture, FlashVideo reduces the time complexity of inference from $O (L^{2})$ to $O (L)$ for a sequence of length $L$ , significantly accelerating inference speed. Additionally, we adopt a redundant-free frame interpolation method, enhancing the efficiency of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Music and Audio Processing

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings