Data standardization for robust lip sync
Chun Wang

TL;DR
This paper introduces a data standardization pipeline for lip sync that disentangles and standardizes visual input to improve robustness and data efficiency of lip sync methods, especially in challenging real-world scenarios.
Contribution
It proposes a novel data standardization approach based on 3D face reconstruction to disentangle lip motion from distracting factors, enhancing lip sync robustness.
Findings
Improved robustness of lip sync methods in wild conditions.
Enhanced data efficiency for existing lip sync models.
Achieved competitive performance in active speaker detection.
Abstract
Lip sync is a fundamental audio-visual task. However, existing lip sync methods fall short of being robust in the wild. One important cause could be distracting factors on the visual input side, making extracting lip motion information difficult. To address these issues, this paper proposes a data standardization pipeline to standardize the visual input for lip sync. Based on recent advances in 3D face reconstruction, we first create a model that can consistently disentangle lip motion information from the raw images. Then, standardized images are synthesized with disentangled lip motion information, with all other attributes related to distracting factors set to predefined values independent of the input, to reduce their effects. Using synthesized images, existing lip sync methods improve their data efficiency and robustness, and they achieve competitive performance for the active…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis
