T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model   with Mixed Reward Feedback

Jiachen Li; Weixi Feng; Tsu-Jui Fu; Xinyi Wang; Sugato; Basu; Wenhu Chen; William Yang Wang

arXiv:2405.18750·cs.CV·October 14, 2024

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato, Basu, Wenhu Chen, William Yang Wang

PDF

Open Access 1 Repo

TL;DR

T2V-Turbo enhances fast, high-quality video generation by integrating mixed reward feedback into the consistency model, surpassing existing models in quality and speed through a novel optimization approach.

Contribution

It introduces a novel method that combines reward feedback with consistency distillation to improve video quality without sacrificing inference speed.

Findings

01

Achieves highest scores on VBench with 4-step generations.

02

Outperforms Gen-2 and Pika in quality metrics.

03

Human evaluations favor T2V-Turbo over traditional methods.

Abstract

Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve $both fast and high-quality video generation$ . We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Ji4chenLi/t2v-turbo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Video Coding and Compression Technologies · Multimedia Communication and Technology

MethodsConsistency Models · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings