PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
Changi Hong, Yoonah Song, Hwayoung Park, Chaewoon Bang, Dayeon Ku, Do Hyun Lee, and Hong Kook Kim

TL;DR
This paper introduces PS-TTS and PS-Comet TTS, innovative methods for improving lip-sync and semantic accuracy in automated multilingual dubbing using phonetic synchronization and paraphrasing.
Contribution
It presents novel synchronization techniques combining paraphrasing, dynamic time warping, and semantic considerations to enhance naturalness and accuracy in AI-based dubbing systems.
Findings
Both systems outperform baseline TTS in objective metrics.
PS-Comet achieves better lip-sync and semantic preservation across languages.
Experiments confirm cross-linguistic applicability of the methods.
Abstract
Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, natural AD still faces synchronization challenges such as duration and lip-synchronization (lip-sync), which are crucial for preserving the viewer experience. Therefore, this paper proposes a synchronization method for AD processes that paraphrases translated text, comprising two steps: isochrony for timing constraints and phonetic synchronization (PS) to preserve lip-sync. First, we achieve isochrony by paraphrasing the translated text with a language model, ensuring the target speech duration matches that of the source speech. Second, we introduce PS, which employs dynamic time warping (DTW) with local costs of vowel distances measured from training data so that the target text composes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
