High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units
Junchen Lu, Berrak Sisman, Mingyang Zhang, Haizhou Li

TL;DR
This paper introduces a new automatic voice-over method that uses self-supervised discrete speech units to improve lip-speech synchronization and speech quality, outperforming existing approaches.
Contribution
It proposes a novel AVO framework leveraging self-supervised discrete speech units for direct alignment supervision, enhancing synchronization and speech quality.
Findings
Achieves superior lip-speech synchronization
Improves speech quality over baselines
Outperforms existing methods in evaluations
Abstract
The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video given its text script. Recent AVO frameworks built upon text-to-speech synthesis (TTS) have shown impressive results. However, the current AVO learning objective of acoustic feature reconstruction brings in indirect supervision for inter-modal alignment learning, thus limiting the synchronization performance and synthetic speech quality. To this end, we propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction, which not only provides more direct supervision for the alignment learning, but also alleviates the mismatch between the text-video context and acoustic features. Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality by outperforming baselines in both objective and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Subtitles and Audiovisual Media
