BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

Anup Singh; Vipul Arora; Kris Demuynck

arXiv:2512.16395·eess.AS·February 19, 2026

BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

Anup Singh, Vipul Arora, Kris Demuynck

PDF

Open Access

TL;DR

This paper introduces BEST-STD2.0, a speech tokenizer optimized for spoken term detection that improves robustness to noise, balances token usage, and enhances retrieval efficiency through novel training and regularization techniques.

Contribution

It presents a noise-robust, balanced, and efficient speech tokenizer with optimal transport regularization and TF-IDF search, advancing spoken term detection performance.

Findings

01

Outperforms baseline STD systems across various noise conditions

02

Improves token utilization and robustness to reverberation

03

Maintains high retrieval efficiency

Abstract

Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but struggle with robustness to noise and reverberation, and with inefficient token utilization. We address these challenges by proposing a noise and reverberation-augmented training strategy to improve tokenizer robustness. In addition, we introduce optimal transport-based regularization to ensure balanced token usage and enhance token efficiency. To further speed up retrieval, we adopt a TF-IDF-based search mechanism. Empirical evaluations demonstrate that the proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing