NAST: Noise Aware Speech Tokenization for Speech Language Models
Shoval Messica, Yossi Adi

TL;DR
NAST introduces a noise-aware speech tokenization method that enhances speech language models by improving robustness and disentanglement in noisy and varied acoustic conditions.
Contribution
The paper presents a novel noise-aware speech tokenization framework with three components, outperforming baselines in noisy speech modeling tasks.
Findings
NAST outperforms baseline methods across multiple speech tasks.
NAST demonstrates robustness to noise, reverberation, pitch-shift, and time-stretch.
NAST exhibits effective disentanglement of speech representations.
Abstract
Speech tokenization is the task of representing speech signals as a sequence of discrete units. Such representations can be later used for various downstream tasks including automatic speech recognition, text-to-speech, etc. More relevant to this study, such representation serves as the basis of Speech Language Models. In this work, we tackle the task of speech tokenization under the noisy setup and present NAST: Noise Aware Speech Tokenization for Speech Language Models. NAST is composed of three main components: (i) a predictor; (ii) a residual encoder; and (iii) a decoder. We evaluate the efficiency of NAST considering several spoken language modeling tasks and show that NAST is superior to the evaluated baselines across all setups. Lastly, we analyze NAST and show its disentanglement properties and robustness to signal variations in the form of noise, reverberation, pitch-shift, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
