MATE: Matryoshka Audio-Text Embeddings for Open-Vocabulary Keyword Spotting
Youngmoon Jung, Myunghun Jung, Joon-Young Yang, Yong-Hyeok Lee, Jaeyoung Roh, Hoon-Young Cho

TL;DR
MATE introduces nested audio-text embeddings with multiple granularities for open-vocabulary keyword spotting, achieving state-of-the-art results without additional inference costs.
Contribution
It proposes a novel matryoshka-style embedding framework with PCA-guided prefix alignment for improved keyword spotting.
Findings
Achieves state-of-the-art results on WSJ and LibriPhrase datasets.
Introduces nested embeddings capturing multiple granularities.
No additional inference overhead required.
Abstract
Open-vocabulary keyword spotting (KWS) with text-based enrollment has emerged as a flexible alternative to fixed-phrase triggers. Prior utterance-level matching methods, from an embedding-learning standpoint, learn embeddings at a single fixed dimensionality. We depart from this design and propose Matryoshka Audio-Text Embeddings (MATE), a dual-encoder framework that encodes multiple embedding granularities within a single vector via nested sub-embeddings ("prefixes"). Specifically, we introduce a PCA-guided prefix alignment: PCA-compressed versions of the full text embedding for each prefix size serve as teacher targets to align both audio and text prefixes. This alignment concentrates salient keyword cues in lower-dimensional prefixes, while higher dimensions add detail. MATE is trained with standard deep metric learning objectives for audio-text KWS, and is loss-agnostic. To our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Advanced Text Analysis Techniques
