MATE: Matryoshka Audio-Text Embeddings for Open-Vocabulary Keyword Spotting

Youngmoon Jung; Myunghun Jung; Joon-Young Yang; Yong-Hyeok Lee; Jaeyoung Roh; Hoon-Young Cho

arXiv:2601.14012·eess.AS·January 21, 2026

MATE: Matryoshka Audio-Text Embeddings for Open-Vocabulary Keyword Spotting

Youngmoon Jung, Myunghun Jung, Joon-Young Yang, Yong-Hyeok Lee, Jaeyoung Roh, Hoon-Young Cho

PDF

Open Access

TL;DR

MATE introduces nested audio-text embeddings with multiple granularities for open-vocabulary keyword spotting, achieving state-of-the-art results without additional inference costs.

Contribution

It proposes a novel matryoshka-style embedding framework with PCA-guided prefix alignment for improved keyword spotting.

Findings

01

Achieves state-of-the-art results on WSJ and LibriPhrase datasets.

02

Introduces nested embeddings capturing multiple granularities.

03

No additional inference overhead required.

Abstract

Open-vocabulary keyword spotting (KWS) with text-based enrollment has emerged as a flexible alternative to fixed-phrase triggers. Prior utterance-level matching methods, from an embedding-learning standpoint, learn embeddings at a single fixed dimensionality. We depart from this design and propose Matryoshka Audio-Text Embeddings (MATE), a dual-encoder framework that encodes multiple embedding granularities within a single vector via nested sub-embeddings ("prefixes"). Specifically, we introduce a PCA-guided prefix alignment: PCA-compressed versions of the full text embedding for each prefix size serve as teacher targets to align both audio and text prefixes. This alignment concentrates salient keyword cues in lower-dimensional prefixes, while higher dimensions add detail. MATE is trained with standard deep metric learning objectives for audio-text KWS, and is loss-agnostic. To our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Advanced Text Analysis Techniques