The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

Xuankai Chang; Jiatong Shi; Jinchuan Tian; Yuning Wu; Yuxun Tang,; Yihan Wu; Shinji Watanabe; Yossi Adi; Xie Chen; Qin Jin

arXiv:2406.07725·cs.SD·June 13, 2024

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang,, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin

PDF

Open Access 1 Datasets

TL;DR

The Interspeech 2024 Challenge introduces benchmarks for speech processing using discrete units across tasks like multilingual ASR, TTS, and singing synthesis, aiming to advance research in this promising area.

Contribution

This paper presents the design of the Interspeech 2024 Challenge, including tasks, baseline systems, and preliminary results, to promote research on discrete unit-based speech processing.

Findings

01

Baseline systems established for each task.

02

Preliminary results indicate potential of discrete units in speech tasks.

03

The challenge fosters standardized evaluation in this emerging field.

Abstract

Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge, which focuses on new speech processing benchmarks using discrete units. It encompasses three pivotal tasks, namely multilingual automatic speech recognition, text-to-speech, and singing voice synthesis, and aims to assess the potential applicability of discrete units in these tasks. This paper outlines the challenge designs and baseline descriptions. We also collate baseline and selected submission systems, along with preliminary findings, offering valuable contributions to future research in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

espnet/DSUChallenge2024
dataset· 91 dl
91 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis