BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection
Anup Singh, Vipul Arora, Kris Demuynck

TL;DR
This paper introduces BEST-STD2.0, a speech tokenizer optimized for spoken term detection that improves robustness to noise, balances token usage, and enhances retrieval efficiency through novel training and regularization techniques.
Contribution
It presents a noise-robust, balanced, and efficient speech tokenizer with optimal transport regularization and TF-IDF search, advancing spoken term detection performance.
Findings
Outperforms baseline STD systems across various noise conditions
Improves token utilization and robustness to reverberation
Maintains high retrieval efficiency
Abstract
Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but struggle with robustness to noise and reverberation, and with inefficient token utilization. We address these challenges by proposing a noise and reverberation-augmented training strategy to improve tokenizer robustness. In addition, we introduce optimal transport-based regularization to ensure balanced token usage and enhance token efficiency. To further speed up retrieval, we adopt a TF-IDF-based search mechanism. Empirical evaluations demonstrate that the proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
