Large-scale unsupervised audio pre-training for video-to-speech synthesis
Triantafyllos Kefalas, Yannis Panagakis, Maja Pantic

TL;DR
This paper introduces a large-scale unsupervised audio pre-training approach for video-to-speech synthesis, leveraging extensive audio-only datasets to improve speech reconstruction quality in a cross-modal setting.
Contribution
It proposes pre-training audio decoders on large audio-only datasets to enhance video-to-speech synthesis, enabling the use of abundant audio data without requiring visual annotations.
Findings
Pre-training improves speech reconstruction quality.
Unsupervised audio pre-training benefits cross-modal synthesis.
Method outperforms existing approaches on benchmarks.
Abstract
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Most established approaches to date involve a two-step process, whereby an intermediate representation from the video, such as a spectrogram, is extracted first and then passed to a vocoder to produce the raw audio. Some recent work has focused on end-to-end synthesis, whereby the generation of raw audio and any intermediate representations is performed jointly. All such approaches involve training on data from almost exclusively audio-visual datasets, i.e. every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality (e.g. audiobooks, radio podcasts, speech recognition datasets etc.), as well as audio-only architectures that have been developed by the audio machine learning community…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization
