Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Yifan Yang; Feiyu Shen; Chenpeng Du; Ziyang Ma; Kai Yu; Daniel Povey,; Xie Chen

arXiv:2309.07377·eess.AS·December 15, 2023

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey,, Xie Chen

PDF

Open Access 1 Repo

TL;DR

This paper investigates the universality of speech discrete tokens generated by SSL models, demonstrating their effectiveness in speech recognition and synthesis, and highlighting their potential to unify multiple speech tasks with lower storage needs.

Contribution

It provides a comprehensive comparison and optimization of discrete tokens across multiple SSL models, showing their potential as universal representations for speech tasks.

Findings

01

Discrete tokens perform comparably to FBank features in recognition

02

They outperform mel-spectrograms in speech synthesis

03

Universal discrete tokens show promise across multiple speech tasks

Abstract

Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k2-fsa/icefall
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems