EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat,, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid,, Felix Kreuk, Yossi Adi, Emmanuel Dupoux

TL;DR
This paper introduces Expresso, a new expressive speech dataset for textless synthesis, and analyzes the performance of self-supervised discrete units in high-quality, expressive speech resynthesis, highlighting challenges and tradeoffs.
Contribution
It provides a novel dataset with spontaneous expressive speech and evaluates discrete unit-based resynthesis, addressing limitations of previous read-speech datasets.
Findings
High-quality resynthesis achievable with self-supervised units
Tradeoffs between quality, bitrate, and style invariance identified
Open source dataset and benchmarks for future research
Abstract
Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
