H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing
Akanksha Singh, Yi-Ping Phoebe Chen, Vipul Arora

TL;DR
H-QuEST introduces a hierarchical indexing framework that significantly speeds up query-by-example spoken term detection using TF-IDF-based sparse representations and advanced audio features, maintaining high accuracy.
Contribution
The paper presents a novel hierarchical indexing approach combining TF-IDF representations and HNSW to accelerate spoken term detection without accuracy loss.
Findings
Substantial speed improvements in retrieval time.
Maintains high accuracy comparable to existing methods.
Effective scalability for large audio datasets.
Abstract
Query-by-example spoken term detection (QbE-STD) searches for matching words or phrases in an audio dataset using a sample spoken query. When annotated data is limited or unavailable, QbE-STD is often done using template matching methods like dynamic time warping (DTW), which are computationally expensive and do not scale well. To address this, we propose H-QuEST (Hierarchical Query-by-Example Spoken Term Detection), a novel framework that accelerates spoken term retrieval by utilizing Term Frequency and Inverse Document Frequency (TF-IDF)-based sparse representations obtained through advanced audio representation learning techniques and Hierarchical Navigable Small World (HNSW) indexing with further refinement. Experimental results show that H-QuEST delivers substantial improvements in retrieval speed without sacrificing accuracy compared to existing methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Time Series Analysis and Forecasting
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
