FusionFrames: Efficient Architectural Aspects for Text-to-Video   Generation Pipeline

Vladimir Arkhipkin; Zein Shaheen; Viacheslav Vasilev; Elizaveta; Dakhova; Andrey Kuznetsov; Denis Dimitrov

arXiv:2311.13073·cs.CV·December 21, 2023·1 cites

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta, Dakhova, Andrey Kuznetsov, Denis Dimitrov

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces a two-stage latent diffusion architecture for text-to-video generation, improving quality and efficiency through novel temporal conditioning and interpolation methods, and achieves top-tier results among open-source solutions.

Contribution

The paper proposes a new two-stage text-to-video generation pipeline with separate temporal blocks and an efficient interpolation model, advancing the state-of-the-art in open-source video synthesis.

Findings

01

Separate temporal blocks outperform temporal layers in quality metrics.

02

The interpolation model reduces computational costs significantly.

03

The pipeline achieves top-2 overall and top-1 among open-source solutions.

Abstract

Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai-forever/kandinskyvideo
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Video Analysis and Summarization

MethodsDiffusion