Text-based Talking Video Editing with Cascaded Conditional Diffusion
Bo Han, Heqing Zou, Haoyang Li, Guangcong Wang, Chng Eng Siong

TL;DR
This paper introduces a cascaded diffusion framework for text-based talking-head video editing that ensures seamless transitions, identity preservation, and generalizable face representations with minimal data and no extensive optimization.
Contribution
It proposes a novel two-stage cascaded diffusion approach combining dense-landmark motion synthesis and warping-guided frame generation for improved talking-head video editing.
Findings
Outperforms previous methods in seamlessness and identity preservation.
Requires less training data and no test-time optimization.
Achieves high-quality, coherent video editing results.
Abstract
Text-based talking-head video editing aims to efficiently insert, delete, and substitute segments of talking videos through a user-friendly text editing approach. It is challenging because of \textbf{1)} generalizable talking-face representation, \textbf{2)} seamless audio-visual transitions, and \textbf{3)} identity-preserved talking faces. Previous works either require minutes of talking-face video training data and expensive test-time optimization for customized talking video editing or directly generate a video sequence without considering in-context information, leading to a poor generalizable representation, or incoherent transitions, or even inconsistent identity. In this paper, we propose an efficient cascaded conditional diffusion-based framework, which consists of two stages: audio to dense-landmark motion and motion to video. \textit{\textbf{In the first stage}}, we first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Data Compression Techniques · Music and Audio Processing
MethodsDiffusion
