Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization
Peter Schaldenbrand, Zhixuan Liu, Jean Oh

TL;DR
This paper presents a fast, pixel-level optimization method guided by CLIP for generating near real-time videos from text descriptions, capable of high-resolution outputs and flexible aspect ratios.
Contribution
It introduces a novel, efficient approach that directly computes CLIP loss at the pixel level for real-time text-to-video generation, bypassing heavy image generator models.
Findings
Achieves 1-2 frames per second at 720p resolution
Supports arbitrary aspect ratios and variable frame rates
Enables near real-time text-guided video synthesis
Abstract
We introduce an approach to generating videos based on a series of given language descriptions. Frames of the video are generated sequentially and optimized by guidance from the CLIP image-text encoder; iterating through language descriptions, weighting the current description higher than others. As opposed to optimizing through an image generator model itself, which tends to be computationally heavy, the proposed approach computes the CLIP loss directly at the pixel level, achieving general content at a speed suitable for near real-time systems. The approach can generate videos in up to 720p resolution, variable frame-rates, and arbitrary aspect ratios at a rate of 1-2 frames per second. Please visit our website to view videos and access our open-source code: https://pschaldenbrand.github.io/text2video/ .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging
MethodsContrastive Language-Image Pre-training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
