Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning

Bolin Lai; Sangmin Lee; Xu Cao; Xiang Li; James M. Rehg

arXiv:2505.20629·cs.CV·March 17, 2026

Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning

Bolin Lai, Sangmin Lee, Xu Cao, Xiang Li, James M. Rehg

PDF

Open Access

TL;DR

This paper introduces FlexTI2V, a training-free method for text-image-to-video generation that flexibly incorporates visual conditions into foundation models using a novel patch swapping strategy and dynamic control, outperforming previous methods.

Contribution

The paper presents a unified, training-free approach for flexible visual conditioning in text-to-video models, enabling arbitrary image conditioning without finetuning.

Findings

01

Outperforms previous training-free image conditioning methods.

02

Generalizes to UNet-based and transformer-based architectures.

03

Effectively balances creativity and fidelity through dynamic control.

Abstract

Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis