Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization

Peter Schaldenbrand; Zhixuan Liu; Jean Oh

arXiv:2210.12826·cs.CV·October 25, 2022

Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization

Peter Schaldenbrand, Zhixuan Liu, Jean Oh

PDF

Open Access 1 Repo

TL;DR

This paper presents a fast, pixel-level optimization method guided by CLIP for generating near real-time videos from text descriptions, capable of high-resolution outputs and flexible aspect ratios.

Contribution

It introduces a novel, efficient approach that directly computes CLIP loss at the pixel level for real-time text-to-video generation, bypassing heavy image generator models.

Findings

01

Achieves 1-2 frames per second at 720p resolution

02

Supports arbitrary aspect ratios and variable frame rates

03

Enables near real-time text-guided video synthesis

Abstract

We introduce an approach to generating videos based on a series of given language descriptions. Frames of the video are generated sequentially and optimized by guidance from the CLIP image-text encoder; iterating through language descriptions, weighting the current description higher than others. As opposed to optimizing through an image generator model itself, which tends to be computationally heavy, the proposed approach computes the CLIP loss directly at the pixel level, achieving general content at a speed suitable for near real-time systems. The approach can generate videos in up to 720p resolution, variable frame-rates, and arbitrary aspect ratios at a rate of 1-2 frames per second. Please visit our website to view videos and access our open-source code: https://pschaldenbrand.github.io/text2video/ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pschaldenbrand/Text2Video
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging

MethodsContrastive Language-Image Pre-training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings