LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

Pai Zhu; Quan Wang; Dhruuv Agarwal; Kurt Partridge

arXiv:2505.22995·eess.AS·February 6, 2026

LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

Pai Zhu, Quan Wang, Dhruuv Agarwal, Kurt Partridge

PDF

Open Access

TL;DR

This paper presents LLM-Synth4KWS, a scalable method for generating confusable training data for custom keyword spotting using large language models and text-to-speech synthesis, improving model robustness.

Contribution

It introduces a novel data augmentation approach leveraging LLMs and TTS to generate confusable utterances, enhancing keyword spotting performance.

Findings

01

AUC improved by 3.7% on Speech Commands dataset

02

Confusable group c-AUC increased by 11.3%

03

Method offers scalable, zero-labor data augmentation

Abstract

Custom keyword spotting (KWS) allows detecting user-defined spoken keywords from streaming audio. This is achieved by comparing the embeddings from voice enrollments and input audio. State-of-the-art custom KWS models are typically trained contrastively using utterances whose keywords are randomly sampled from training dataset. These KWS models often struggle with confusing keywords, such as "blue" versus "glue". This paper introduces an effective way to augment the training with confusable utterances where keywords are generated and grouped from large language models (LLMs), and speech signals are synthesized with diverse speaking styles from text-to-speech (TTS) engines. To better measure user experience on confusable KWS, we define a new northstar metric using the average area under DET curve from confusable groups (c-AUC). Featuring high scalability and zero labor cost, the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders