FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling
Kim Sung-Bin, Joohyun Chang, David Harwath, Tae-Hyun Oh

TL;DR
This paper introduces FacEDiT, a unified diffusion-based model that treats talking face editing and generation as a single task of speech-conditional facial motion infilling, enabling seamless, accurate, and speech-aligned face synthesis and editing.
Contribution
The paper proposes a novel unified framework for talking face editing and generation using facial motion infilling with a diffusion transformer, along with a new benchmark dataset and evaluation metrics.
Findings
FacEDiT achieves accurate speech-aligned facial edits.
The model ensures seamless transitions and identity preservation.
It generalizes well to talking face generation.
Abstract
Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Evolutionary Psychology and Human Behavior
