Context-Aware Talking-Head Video Editing
Songlin Yang, Wei Wang, Jun Ling, Bo Peng, Xu Tan, Jing Dong

TL;DR
This paper introduces a novel, efficient framework for talking-head video editing that ensures accurate lip synchronization, smooth motion, and disentangled control of verbal and non-verbal cues, outperforming prior methods.
Contribution
The work presents a new framework combining motion prediction and neural rendering for efficient, high-quality talking-head video editing with disentangled control.
Findings
Achieves smoother, more realistic video edits with higher lip-sync accuracy.
Requires less data and training time compared to previous methods.
Provides better generalization to unseen speech and identities.
Abstract
Talking-head video editing aims to efficiently insert, delete, and substitute the word of a pre-recorded video through a text transcript editor. The key challenge for this task is obtaining an editing model that generates new talking-head video clips which simultaneously have accurate lip synchronization and motion smoothness. Previous approaches, including 3DMM-based (3D Morphable Model) methods and NeRF-based (Neural Radiance Field) methods, are sub-optimal in that they either require minutes of source videos and days of training time or lack the disentangled control of verbal (e.g., lip motion) and non-verbal (e.g., head pose and expression) representations for video clip insertion. In this work, we fully utilize the video context to design a novel framework for talking-head video editing, which achieves efficiency, disentangled motion control, and sequential smoothness.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Video Analysis and Summarization
