Exploring SSL Discrete Tokens for Multilingual ASR

Mingyu Cui; Daxin Tan; Yifan Yang; Dingdong Wang; Huimeng Wang; Xiao; Chen; Xie Chen; Xunying Liu

arXiv:2409.08805·cs.CL·September 16, 2024

Exploring SSL Discrete Tokens for Multilingual ASR

Mingyu Cui, Daxin Tan, Yifan Yang, Dingdong Wang, Huimeng Wang, Xiao, Chen, Xie Chen, Xunying Liu

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of SSL-generated discrete tokens for multilingual ASR, demonstrating comparable or improved performance over traditional features across multiple languages with notable WER reductions.

Contribution

It provides a comprehensive comparison of SSL discrete tokens for multilingual ASR, filling a gap in understanding their performance across diverse language domains.

Findings

01

Discrete tokens achieve comparable results to Fbank features in ASR.

02

Average WER reduction of 0.31% and 1.76% on dev and test sets.

03

Significant WER reduction of 6.82% on Polish test set.

Abstract

With the advancement of Self-supervised Learning (SSL) in speech-related tasks, there has been growing interest in utilizing discrete tokens generated by SSL for automatic speech recognition (ASR), as they offer faster processing techniques. However, previous studies primarily focused on multilingual ASR with Fbank features or English ASR with discrete tokens, leaving a gap in adapting discrete tokens for multilingual ASR scenarios. This study presents a comprehensive comparison of discrete tokens generated by various leading SSL models across multiple language domains. We aim to explore the performance and efficiency of speech discrete tokens across multiple language domains for both monolingual and multilingual ASR scenarios. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on Fbank features in ASR tasks across seven language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems