Text-based Talking Video Editing with Cascaded Conditional Diffusion

Bo Han; Heqing Zou; Haoyang Li; Guangcong Wang; Chng Eng Siong

arXiv:2407.14841·cs.CV·July 23, 2024

Text-based Talking Video Editing with Cascaded Conditional Diffusion

Bo Han, Heqing Zou, Haoyang Li, Guangcong Wang, Chng Eng Siong

PDF

Open Access

TL;DR

This paper introduces a cascaded diffusion framework for text-based talking-head video editing that ensures seamless transitions, identity preservation, and generalizable face representations with minimal data and no extensive optimization.

Contribution

It proposes a novel two-stage cascaded diffusion approach combining dense-landmark motion synthesis and warping-guided frame generation for improved talking-head video editing.

Findings

01

Outperforms previous methods in seamlessness and identity preservation.

02

Requires less training data and no test-time optimization.

03

Achieves high-quality, coherent video editing results.

Abstract

Text-based talking-head video editing aims to efficiently insert, delete, and substitute segments of talking videos through a user-friendly text editing approach. It is challenging because of \textbf{1)} generalizable talking-face representation, \textbf{2)} seamless audio-visual transitions, and \textbf{3)} identity-preserved talking faces. Previous works either require minutes of talking-face video training data and expensive test-time optimization for customized talking video editing or directly generate a video sequence without considering in-context information, leading to a poor generalizable representation, or incoherent transitions, or even inconsistent identity. In this paper, we propose an efficient cascaded conditional diffusion-based framework, which consists of two stages: audio to dense-landmark motion and motion to video. \textit{\textbf{In the first stage}}, we first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Data Compression Techniques · Music and Audio Processing

MethodsDiffusion