EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech   Resynthesis

Tu Anh Nguyen; Wei-Ning Hsu; Antony D'Avirro; Bowen Shi; Itai Gat,; Maryam Fazel-Zarani; Tal Remez; Jade Copet; Gabriel Synnaeve; Michael Hassid,; Felix Kreuk; Yossi Adi; Emmanuel Dupoux

arXiv:2308.05725·cs.CL·August 11, 2023

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat,, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid,, Felix Kreuk, Yossi Adi, Emmanuel Dupoux

PDF

Open Access 2 Models 5 Datasets

TL;DR

This paper introduces Expresso, a new expressive speech dataset for textless synthesis, and analyzes the performance of self-supervised discrete units in high-quality, expressive speech resynthesis, highlighting challenges and tradeoffs.

Contribution

It provides a novel dataset with spontaneous expressive speech and evaluates discrete unit-based resynthesis, addressing limitations of previous read-speech datasets.

Findings

01

High-quality resynthesis achievable with self-supervised units

02

Tradeoffs between quality, bitrate, and style invariance identified

03

Open source dataset and benchmarks for future research

Abstract

Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems