Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis

Tongtong Su; Chengyu Wang; Bingyan Liu; Jun Huang; Dongming Lu

arXiv:2507.13753·cs.CV·July 21, 2025

Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis

Tongtong Su, Chengyu Wang, Bingyan Liu, Jun Huang, Dongming Lu

PDF

Open Access

TL;DR

This paper presents EVS, a training-free method that combines text-to-image and text-to-video models to generate high-quality, smooth videos from text descriptions, improving visual fidelity and motion consistency.

Contribution

EVS introduces a novel composition of T2I and T2V models that enhances video quality without additional training, addressing flickering and artifacts in text-to-video synthesis.

Findings

01

Significant improvement in video quality and motion smoothness.

02

Inference speed increased by 1.6x to 4.5x.

03

Validated effectiveness through experimental comparisons.

Abstract

In recent years, large text-to-video (T2V) synthesis models have garnered considerable attention for their abilities to generate videos from textual descriptions. However, achieving both high imaging quality and effective motion representation remains a significant challenge for these T2V models. Existing approaches often adapt pre-trained text-to-image (T2I) models to refine video frames, leading to issues such as flickering and artifacts due to inconsistencies across frames. In this paper, we introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness of generated videos. Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames by treating them as out-of-distribution samples, effectively optimizing them with noising and denoising steps. Meanwhile, we employ…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Face recognition and analysis