High-Quality Automatic Voice Over with Accurate Alignment: Supervision   through Self-Supervised Discrete Speech Units

Junchen Lu; Berrak Sisman; Mingyang Zhang; Haizhou Li

arXiv:2306.17005·eess.AS·June 30, 2023

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

Junchen Lu, Berrak Sisman, Mingyang Zhang, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces a new automatic voice-over method that uses self-supervised discrete speech units to improve lip-speech synchronization and speech quality, outperforming existing approaches.

Contribution

It proposes a novel AVO framework leveraging self-supervised discrete speech units for direct alignment supervision, enhancing synchronization and speech quality.

Findings

01

Achieves superior lip-speech synchronization

02

Improves speech quality over baselines

03

Outperforms existing methods in evaluations

Abstract

The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video given its text script. Recent AVO frameworks built upon text-to-speech synthesis (TTS) have shown impressive results. However, the current AVO learning objective of acoustic feature reconstruction brings in indirect supervision for inter-modal alignment learning, thus limiting the synchronization performance and synthetic speech quality. To this end, we propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction, which not only provides more direct supervision for the alignment learning, but also alleviates the mismatch between the text-video context and acoustic features. Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality by outperforming baselines in both objective and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Subtitles and Audiovisual Media