H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing

Akanksha Singh; Yi-Ping Phoebe Chen; Vipul Arora

arXiv:2506.16751·eess.AS·June 23, 2025·Interspeech

H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing

Akanksha Singh, Yi-Ping Phoebe Chen, Vipul Arora

PDF

Open Access

TL;DR

H-QuEST introduces a hierarchical indexing framework that significantly speeds up query-by-example spoken term detection using TF-IDF-based sparse representations and advanced audio features, maintaining high accuracy.

Contribution

The paper presents a novel hierarchical indexing approach combining TF-IDF representations and HNSW to accelerate spoken term detection without accuracy loss.

Findings

01

Substantial speed improvements in retrieval time.

02

Maintains high accuracy comparable to existing methods.

03

Effective scalability for large audio datasets.

Abstract

Query-by-example spoken term detection (QbE-STD) searches for matching words or phrases in an audio dataset using a sample spoken query. When annotated data is limited or unavailable, QbE-STD is often done using template matching methods like dynamic time warping (DTW), which are computationally expensive and do not scale well. To address this, we propose H-QuEST (Hierarchical Query-by-Example Spoken Term Detection), a novel framework that accelerates spoken term retrieval by utilizing Term Frequency and Inverse Document Frequency (TF-IDF)-based sparse representations obtained through advanced audio representation learning techniques and Hierarchical Navigable Small World (HNSW) indexing with further refinement. Experimental results show that H-QuEST delivers substantial improvements in retrieval speed without sacrificing accuracy compared to existing methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Time Series Analysis and Forecasting

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings