FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta, Dakhova, Andrey Kuznetsov, Denis Dimitrov

TL;DR
This paper introduces a two-stage latent diffusion architecture for text-to-video generation, improving quality and efficiency through novel temporal conditioning and interpolation methods, and achieves top-tier results among open-source solutions.
Contribution
The paper proposes a new two-stage text-to-video generation pipeline with separate temporal blocks and an efficient interpolation model, advancing the state-of-the-art in open-source video synthesis.
Findings
Separate temporal blocks outperform temporal layers in quality metrics.
The interpolation model reduces computational costs significantly.
The pipeline achieves top-2 overall and top-1 among open-source solutions.
Abstract
Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Video Analysis and Summarization
MethodsDiffusion
