GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting
Anushka Agarwal, Muhammad Yusuf Hassan, Talha Chafekar

TL;DR
GenSync is a unified framework that synthesizes lip-synced videos for multiple speakers using 3D Gaussian Splatting, with a disentanglement module for identity and audio separation, achieving faster training and high quality.
Contribution
It introduces a multi-identity lip-sync framework with a disentanglement module, reducing training time and maintaining high visual and lip-sync quality.
Findings
Achieves 6.8x faster training than state-of-the-art models.
Maintains high lip-sync accuracy and visual quality across multiple identities.
Uses 3D Gaussian Splatting for efficient multi-subject video synthesis.
Abstract
We introduce GenSync, a novel framework for multi-identity lip-synced video synthesis using 3D Gaussian Splatting. Unlike most existing 3D methods that require training a new model for each identity , GenSync learns a unified network that synthesizes lip-synced videos for multiple speakers. By incorporating a Disentanglement Module, our approach separates identity-specific features from audio representations, enabling efficient multi-identity video synthesis. This design reduces computational overhead and achieves 6.8x faster training compared to state-of-the-art models, while maintaining high lip-sync accuracy and visual quality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music Technology and Sound Studies · Music and Audio Processing
