Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning
Bolin Lai, Sangmin Lee, Xu Cao, Xiang Li, James M. Rehg

TL;DR
This paper introduces FlexTI2V, a training-free method for text-image-to-video generation that flexibly incorporates visual conditions into foundation models using a novel patch swapping strategy and dynamic control, outperforming previous methods.
Contribution
The paper presents a unified, training-free approach for flexible visual conditioning in text-to-video models, enabling arbitrary image conditioning without finetuning.
Findings
Outperforms previous training-free image conditioning methods.
Generalizes to UNet-based and transformer-based architectures.
Effectively balances creativity and fidelity through dynamic control.
Abstract
Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis
