BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition

Hyunsik Kim; Haeri Kim; Munhak Lee; Kyungmin Lee

arXiv:2602.01717·cs.CL·February 3, 2026

BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition

Hyunsik Kim, Haeri Kim, Munhak Lee, Kyungmin Lee

PDF

Open Access

TL;DR

This paper introduces BBPE16, a UTF-16-based byte-level byte-pair encoding method that improves multilingual speech recognition by reducing token sequence length and computational load, especially for non-Latin scripts.

Contribution

BBPE16 is a novel UTF-16-based tokenizer that maintains language-agnostic properties while enhancing cross-lingual token sharing and efficiency for multilingual ASR tasks.

Findings

01

Reduces token counts for Chinese by up to 10.4%

02

Lowers decoding iterations by up to 10.3%

03

Achieves comparable or better accuracy across multiple language setups

Abstract

Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its variable-length encoding inflates token sequences for non-Latin scripts, such as Chinese, Japanese, and Korean (CJK). Longer sequences increase computational load and memory use. We propose BBPE16, a UTF-16-based BBPE tokenizer that represents most modern scripts with a uniform 2-byte code unit. BBPE16 preserves BBPE's language-agnostic properties while substantially improving cross-lingual token sharing. Across monolingual, bilingual, and trilingual ASR, and in a multilingual continual-learning setup, BBPE16 attains comparable or better accuracy; for Chinese, it reduces token counts by up to 10.4% and lowers decoding iterations by up to 10.3%. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques · Speech and Audio Processing