Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English
Haoyang Zhang, Hexin Liu, Xiangyu Zhang, Qiquan Zhang, Yuchen Hu, Junqi Zhao, Fei Tian, Xuerui Yang, Leibny Paola Garcia, Eng Siong Chng

TL;DR
This study examines how different speech frame rates affect tokenization quality in Mandarin and English, revealing language-specific impacts and guiding optimal frame rate choices for speech recognition systems.
Contribution
It provides the first detailed analysis of frame rate effects on speech tokenization across two typologically distinct languages, highlighting language-specific differences.
Findings
Frame rate variations impact speech tokenization differently for Mandarin and English.
Optimal frame rate selection depends on language-specific acoustic features.
Insights can improve speech recognition and related applications.
Abstract
The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. We encode speech at different frame rates and evaluate the resulting semantic tokens in the speech recognition task. Our findings reveal that frame rate variations influence speech tokenization differently for each language, highlighting the interplay between frame rates, phonetic density, and language-specific acoustic features. The results provide insights into optimizing frame rate selection for speech tokenizers, with implications for automatic speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
