BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection
Anup Singh, Kris Demuynck, Vipul Arora

TL;DR
This paper introduces a novel speech tokenization method that encodes speech into discrete, speaker-agnostic tokens, enabling faster and more accurate spoken term detection, especially for out-of-vocabulary terms.
Contribution
It proposes a bidirectional Mamba-enhanced self-supervised framework for generating consistent, speaker-invariant speech tokens for improved STD performance.
Findings
Outperforms existing STD baselines on LibriSpeech and TIMIT
Produces more speaker-invariant speech tokens
Enables fast, text-based retrieval for spoken terms
Abstract
Spoken term detection (STD) is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching, limiting its practicality. To address these challenges, we propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens. This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms. Our approach focuses on generating consistent token sequences across varying utterances of the same term. We also propose a bidirectional state space modeling within the Mamba encoder, trained in a self-supervised learning framework, to learn contextual frame-level features that are further encoded into discrete tokens. Our analysis shows that our speech tokens exhibit greater speaker invariance than those from existing tokenizers, making them more suitable for STD tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · Spatial-Channel Token Distillation
