Bilingual End-to-End ASR with Byte-Level Subwords
Liuhui Deng, Roger Hsiao, Arnab Ghoshal

TL;DR
This paper explores how different output representations in end-to-end neural networks impact multilingual speech recognition, demonstrating that byte-level BPE with penalty schemes enhances performance in bilingual ASR tasks for English and Mandarin.
Contribution
It introduces the use of byte-level BPE with penalty schemes for bilingual ASR, showing improved accuracy with fewer outputs and parameters compared to other representations.
Findings
BBPE with penalty schemes improves bilingual ASR by 2-5%
Byte-level representations reduce model complexity
Analysis suggests future directions for multilingual ASR
Abstract
In this paper, we investigate how the output representation of an end-to-end neural network affects multilingual automatic speech recognition (ASR). We study different representations including character-level, byte-level, byte pair encoding (BPE), and byte-level byte pair encoding (BBPE) representations, and analyze their strengths and weaknesses. We focus on developing a single end-to-end model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances. We conduct our experiments on English and Mandarin dictation tasks, and we find that BBPE with penalty schemes can improve utterance-based bilingual ASR performance by 2% to 5% relative even with smaller number of outputs and fewer parameters. We conclude with analysis that indicates directions for further improving multilingual ASR.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
