Large-scale unsupervised audio pre-training for video-to-speech   synthesis

Triantafyllos Kefalas; Yannis Panagakis; Maja Pantic

arXiv:2306.15464·cs.SD·October 28, 2024

Large-scale unsupervised audio pre-training for video-to-speech synthesis

Triantafyllos Kefalas, Yannis Panagakis, Maja Pantic

PDF

Open Access

TL;DR

This paper introduces a large-scale unsupervised audio pre-training approach for video-to-speech synthesis, leveraging extensive audio-only datasets to improve speech reconstruction quality in a cross-modal setting.

Contribution

It proposes pre-training audio decoders on large audio-only datasets to enhance video-to-speech synthesis, enabling the use of abundant audio data without requiring visual annotations.

Findings

01

Pre-training improves speech reconstruction quality.

02

Unsupervised audio pre-training benefits cross-modal synthesis.

03

Method outperforms existing approaches on benchmarks.

Abstract

Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Most established approaches to date involve a two-step process, whereby an intermediate representation from the video, such as a spectrogram, is extracted first and then passed to a vocoder to produce the raw audio. Some recent work has focused on end-to-end synthesis, whereby the generation of raw audio and any intermediate representations is performed jointly. All such approaches involve training on data from almost exclusively audio-visual datasets, i.e. every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality (e.g. audiobooks, radio podcasts, speech recognition datasets etc.), as well as audio-only architectures that have been developed by the audio machine learning community…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization