Miipher: A Robust Speech Restoration Model Integrating Self-Supervised   Speech and Text Representations

Yuma Koizumi; Heiga Zen; Shigeki Karita; Yifan Ding; Kohei Yatabe,; Nobuyuki Morioka; Yu Zhang; Wei Han; Ankur Bapna; Michiel Bacchiani

arXiv:2303.01664·cs.SD·August 15, 2023·1 cites

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe,, Nobuyuki Morioka, Yu Zhang, Wei Han, Ankur Bapna, Michiel Bacchiani

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Miipher is a robust speech restoration model that uses self-supervised speech and text representations to convert degraded speech into high-quality audio, enabling improved training data for speech generation.

Contribution

We introduce Miipher, a novel SR model combining w2v-BERT and PnG-BERT representations for robustness and application to web-collected speech data.

Findings

01

Miipher is robust against various audio degradations.

02

Enables training high-quality TTS from web-collected speech.

03

Demonstrates improved speech restoration performance.

Abstract

Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature. Experiments show that Miipher (i) is robust against various audio degradation and (ii) enable us to train a high-quality text-to-speech (TTS) model from restored speech samples collected from the Web. Audio samples are available at our demo page: google.github.io/df-conformer/miipher/

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Wataru-Nakata/miipher
pytorch

Datasets

lucasnewman/libritts-r-webdataset
dataset· 446 dl
446 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research