EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

John Flynn; Wolfgang Paier; Dimitar Dinev; Sam Nhut Nguyen; Hayk Poghosyan; Manuel Toribio; Sandipan Banerjee; Guy Gafni

arXiv:2601.22127·cs.CV·January 30, 2026

EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

John Flynn, Wolfgang Paier, Dimitar Dinev, Sam Nhut Nguyen, Hayk Poghosyan, Manuel Toribio, Sandipan Banerjee, Guy Gafni

PDF

Open Access

TL;DR

EditYourself is a diffusion transformer-based framework that enables precise, audio-driven editing of talking head videos, allowing seamless content modification while preserving motion, identity, and lip sync.

Contribution

It introduces a novel audio-conditioned video editing method using diffusion transformers, enabling transcript-based modifications with high fidelity and temporal coherence.

Findings

01

Achieves accurate lip synchronization in edited videos.

02

Maintains visual identity and motion coherence over long durations.

03

Enables realistic addition, removal, and retiming of speech segments.

Abstract

Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Music Technology and Sound Studies