Bilingual End-to-End ASR with Byte-Level Subwords

Liuhui Deng; Roger Hsiao; Arnab Ghoshal

arXiv:2205.00485·cs.CL·May 3, 2022

Bilingual End-to-End ASR with Byte-Level Subwords

Liuhui Deng, Roger Hsiao, Arnab Ghoshal

PDF

Open Access

TL;DR

This paper explores how different output representations in end-to-end neural networks impact multilingual speech recognition, demonstrating that byte-level BPE with penalty schemes enhances performance in bilingual ASR tasks for English and Mandarin.

Contribution

It introduces the use of byte-level BPE with penalty schemes for bilingual ASR, showing improved accuracy with fewer outputs and parameters compared to other representations.

Findings

01

BBPE with penalty schemes improves bilingual ASR by 2-5%

02

Byte-level representations reduce model complexity

03

Analysis suggests future directions for multilingual ASR

Abstract

In this paper, we investigate how the output representation of an end-to-end neural network affects multilingual automatic speech recognition (ASR). We study different representations including character-level, byte-level, byte pair encoding (BPE), and byte-level byte pair encoding (BBPE) representations, and analyze their strengths and weaknesses. We focus on developing a single end-to-end model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances. We conduct our experiments on English and Mandarin dictation tasks, and we find that BBPE with penalty schemes can improve utterance-based bilingual ASR performance by 2% to 5% relative even with smaller number of outputs and fewer parameters. We conclude with analysis that indicates directions for further improving multilingual ASR.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing