TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch

Xingchen Song; Chengdong Liang; Binbin Zhang; Pengshen Zhang; ZiYu; Wang; Youcheng Ma; Menglong Xu; Lin Wang; Di Wu; Fuping Pan; Dinghao Zhou,; Zhendong Peng

arXiv:2412.15622·eess.AS·December 23, 2024

TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch

Xingchen Song, Chengdong Liang, Binbin Zhang, Pengshen Zhang, ZiYu, Wang, Youcheng Ma, Menglong Xu, Lin Wang, Di Wu, Fuping Pan, Dinghao Zhou,, Zhendong Peng

PDF

Open Access

TL;DR

TouchASP introduces an elastic, scalable speech recognition model trained on diverse data, capable of multilingual and multi-dialect perception, reducing error rates and broadening application scope.

Contribution

The paper presents the elastic mixture of experts model and an unsupervised data creation method, enabling scalable deployment and multi-faceted speech perception.

Findings

01

Reduced CER from 4.98% to 2.45% on SpeechIO testsets.

02

Achieved multilingual, multi-dialect, emotion, gender, and sound event perception.

03

Enabled elastic deployment adaptable to various resource constraints.

Abstract

Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially propose the elastic mixture of the expert (eMoE) model. This model can be trained just once and then be elastically scaled in accordance with deployment requirements. Secondly, we devise an unsupervised data creation and validation procedure and gather millions of hours of audio data from diverse domains for training. Using these two techniques, our system achieves elastic deployment capabilities while reducing the Character Error Rate (CER) on the SpeechIO testsets from 4.98\% to 2.45\%.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTactile and Sensory Interactions · Social Robot Interaction and HRI · Speech and dialogue systems