BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition
Hyunsik Kim, Haeri Kim, Munhak Lee, Kyungmin Lee

TL;DR
This paper introduces BBPE16, a UTF-16-based byte-level byte-pair encoding method that improves multilingual speech recognition by reducing token sequence length and computational load, especially for non-Latin scripts.
Contribution
BBPE16 is a novel UTF-16-based tokenizer that maintains language-agnostic properties while enhancing cross-lingual token sharing and efficiency for multilingual ASR tasks.
Findings
Reduces token counts for Chinese by up to 10.4%
Lowers decoding iterations by up to 10.3%
Achieves comparable or better accuracy across multiple language setups
Abstract
Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its variable-length encoding inflates token sequences for non-Latin scripts, such as Chinese, Japanese, and Korean (CJK). Longer sequences increase computational load and memory use. We propose BBPE16, a UTF-16-based BBPE tokenizer that represents most modern scripts with a uniform 2-byte code unit. BBPE16 preserves BBPE's language-agnostic properties while substantially improving cross-lingual token sharing. Across monolingual, bilingual, and trilingual ASR, and in a multilingual continual-learning setup, BBPE16 attains comparable or better accuracy; for Chinese, it reduces token counts by up to 10.4% and lowers decoding iterations by up to 10.3%. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques · Speech and Audio Processing
