The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang,, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin

TL;DR
The Interspeech 2024 Challenge introduces benchmarks for speech processing using discrete units across tasks like multilingual ASR, TTS, and singing synthesis, aiming to advance research in this promising area.
Contribution
This paper presents the design of the Interspeech 2024 Challenge, including tasks, baseline systems, and preliminary results, to promote research on discrete unit-based speech processing.
Findings
Baseline systems established for each task.
Preliminary results indicate potential of discrete units in speech tasks.
The challenge fosters standardized evaluation in this emerging field.
Abstract
Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge, which focuses on new speech processing benchmarks using discrete units. It encompasses three pivotal tasks, namely multilingual automatic speech recognition, text-to-speech, and singing voice synthesis, and aims to assess the potential applicability of discrete units in these tasks. This paper outlines the challenge designs and baseline descriptions. We also collate baseline and selected submission systems, along with preliminary findings, offering valuable contributions to future research in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
