Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition
Chao Wang, Yuqing Cai, Renzeng Duojie, Jin Zhang, Yutong Liu, Nyima Tashi

TL;DR
This paper introduces a context-aware dynamic chunking approach for streaming Tibetan speech recognition, improving accuracy and latency by adaptively adjusting chunk sizes and incorporating linguistic and language model information.
Contribution
It presents a novel adaptive chunking mechanism combined with linguistic units and external language models for Tibetan speech recognition.
Findings
Achieved 6.23% WER on test set
48.15% relative improvement over fixed-chunk baseline
Reduced recognition latency significantly
Abstract
In this work, we propose a streaming speech recognition framework for Amdo Tibetan, built upon a hybrid CTC/Atten-tion architecture with a context-aware dynamic chunking mechanism. The proposed strategy adaptively adjusts chunk widths based on encoding states, enabling flexible receptive fields, cross-chunk information exchange, and robust adaptation to varying speaking rates, thereby alleviating the context truncation problem of fixed-chunk methods. To further capture the linguistic characteristics of Tibetan, we construct a lexicon grounded in its orthographic principles, providing linguistically motivated modeling units. During decoding, an external language model is integrated to enhance semantic consistency and improve recognition of long sentences. Experimental results show that the proposed framework achieves a word error rate (WER) of 6.23% on the test set, yielding a 48.15%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
