MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech

Junming Yuan; Ying Shi; Dong Wang; Lantian Li; Askar Hamdulla

arXiv:2511.06296·cs.SD·November 11, 2025

MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech

Junming Yuan, Ying Shi, Dong Wang, Lantian Li, Askar Hamdulla

PDF

Open Access

TL;DR

MT-HuBERT introduces a self-supervised pre-training framework for few-shot keyword spotting in mixed speech, effectively handling overlapping keywords and outperforming state-of-the-art methods in both mixed and clean conditions.

Contribution

This work presents MT-HuBERT, a novel self-supervised Mix-Training approach that leverages unlabeled data for improved mixed speech keyword spotting.

Findings

01

Outperforms state-of-the-art baselines in few-shot KWS tasks

02

Effective in both mixed and clean speech conditions

03

Demonstrates efficiency with unlabeled data in pre-training

Abstract

Few-shot keyword spotting aims to detect previously unseen keywords with very limited labeled samples. A pre-training and adaptation paradigm is typically adopted for this task. While effective in clean conditions, most existing approaches struggle with mixed keyword spotting--detecting multiple overlapping keywords within a single utterance--a capability essential for real-world applications. We have previously proposed a pre-training approach based on Mix-Training (MT) to tackle the mixed keyword detection problem and demonstrated its efficiency. However, this approach is fully supervised, unable to utilize vast unlabeled data. To this end, we propose Mix-Training HuBERT (MT-HuBERT), a self-supervised learning (SSL) pre-training framework that implements the MT criterion during pre-training. MT-HuBERT predicts, in a self-supervised manner, the clean acoustic units of each constituent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling