STAB: Speech Tokenizer Assessment Benchmark

Shikhar Vashishth; Harman Singh; Shikhar Bharadwaj; Sriram Ganapathy,; Chulayuth Asawaroengchai; Kartik Audhkhasi; Andrew Rosenberg; Ankur Bapna,; Bhuvana Ramabhadran

arXiv:2409.02384·cs.CL·September 5, 2024

STAB: Speech Tokenizer Assessment Benchmark

Shikhar Vashishth, Harman Singh, Shikhar Bharadwaj, Sriram Ganapathy,, Chulayuth Asawaroengchai, Kartik Audhkhasi, Andrew Rosenberg, Ankur Bapna,, Bhuvana Ramabhadran

PDF

Open Access

TL;DR

STAB is a comprehensive benchmark framework designed to evaluate speech tokenizers systematically, helping researchers understand their properties and improve their performance across various speech tasks.

Contribution

The paper introduces STAB, a standardized evaluation framework for speech tokenizers, addressing the lack of systematic assessment methods and enabling better comparison and understanding.

Findings

01

STAB metrics correlate with downstream task performance.

02

The framework reveals key tokenizer properties affecting task outcomes.

03

Evaluation across multiple tasks demonstrates the utility of STAB.

Abstract

Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text, thus enabling the use of speech as an input to the widely successful large language models (LLMs). Currently, while several speech tokenizers have been proposed, there is ambiguity regarding the properties that are desired from a tokenizer for specific downstream tasks and its overall generalizability. Evaluating the performance of tokenizers across different downstream tasks is a computationally intensive effort that poses challenges for scalability. To circumvent this requirement, we present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively and shed light on their inherent characteristics. This framework provides a deeper understanding of the underlying mechanisms of speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis