NAST: Noise Aware Speech Tokenization for Speech Language Models

Shoval Messica; Yossi Adi

arXiv:2406.11037·cs.SD·June 18, 2024

NAST: Noise Aware Speech Tokenization for Speech Language Models

Shoval Messica, Yossi Adi

PDF

Open Access 1 Repo 1 Models

TL;DR

NAST introduces a noise-aware speech tokenization method that enhances speech language models by improving robustness and disentanglement in noisy and varied acoustic conditions.

Contribution

The paper presents a novel noise-aware speech tokenization framework with three components, outperforming baselines in noisy speech modeling tasks.

Findings

01

NAST outperforms baseline methods across multiple speech tasks.

02

NAST demonstrates robustness to noise, reverberation, pitch-shift, and time-stretch.

03

NAST exhibits effective disentanglement of speech representations.

Abstract

Speech tokenization is the task of representing speech signals as a sequence of discrete units. Such representations can be later used for various downstream tasks including automatic speech recognition, text-to-speech, etc. More relevant to this study, such representation serves as the basis of Speech Language Models. In this work, we tackle the task of speech tokenization under the noisy setup and present NAST: Noise Aware Speech Tokenization for Speech Language Models. NAST is composed of three main components: (i) a predictor; (ii) a residual encoder; and (iii) a decoder. We evaluate the efficiency of NAST considering several spoken language modeling tasks and show that NAST is superior to the evaluated baselines across all setups. Lastly, we analyze NAST and show its disentanglement properties and robustness to signal variations in the form of noise, reverberation, pitch-shift, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ShovalMessica/NAST
pytorchOfficial

Models

🤗
shovalmessica/NAST
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems