BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken   Term Detection

Anup Singh; Kris Demuynck; Vipul Arora

arXiv:2411.14100·eess.AS·December 24, 2024

BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection

Anup Singh, Kris Demuynck, Vipul Arora

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel speech tokenization method that encodes speech into discrete, speaker-agnostic tokens, enabling faster and more accurate spoken term detection, especially for out-of-vocabulary terms.

Contribution

It proposes a bidirectional Mamba-enhanced self-supervised framework for generating consistent, speaker-invariant speech tokens for improved STD performance.

Findings

01

Outperforms existing STD baselines on LibriSpeech and TIMIT

02

Produces more speaker-invariant speech tokens

03

Enables fast, text-based retrieval for spoken terms

Abstract

Spoken term detection (STD) is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching, limiting its practicality. To address these challenges, we propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens. This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms. Our approach focuses on generating consistent token sequences across varying utterances of the same term. We also propose a bidirectional state space modeling within the Mamba encoder, trained in a self-supervised learning framework, to learn contextual frame-level features that are further encoded into discrete tokens. Our analysis shows that our speech tokens exhibit greater speaker invariance than those from existing tokenizers, making them more suitable for STD tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anupsingh15/BEST-STD
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · Spatial-Channel Token Distillation