GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples

Harry Zhang; Kurt Partridge; Pai Zhu; Neng Chen; Hyun Jin Park; Dhruuv Agarwal; Quan Wang

arXiv:2505.14814·cs.SD·February 6, 2026

GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples

Harry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun Jin Park, Dhruuv Agarwal, Quan Wang

PDF

Open Access

TL;DR

GraphemeAug systematically generates adversarial hard negative examples for keyword spotting by editing graphemes, significantly improving model discrimination near decision boundaries without sacrificing positive or ambient negative data quality.

Contribution

The paper introduces a novel method to create synthesized hard negatives for KWS by manipulating graphemes, enhancing training data for better boundary detection.

Findings

01

AUC improved by 61% on synthetic hard negatives

02

Maintains quality on positive and ambient negative data

03

Effective boundary-focused data augmentation for KWS

Abstract

Spoken Keyword Spotting (KWS) is the task of distinguishing between the presence and absence of a keyword in audio. The accuracy of a KWS model hinges on its ability to correctly classify examples close to the keyword and non-keyword boundary. These boundary examples are often scarce in training data, limiting model performance. In this paper, we propose a method to systematically generate adversarial examples close to the decision boundary by making insertion/deletion/substitution edits on the keyword's graphemes. We evaluate this technique on held-out data for a popular keyword and show that the technique improves AUC on a dataset of synthetic hard negatives by 61% while maintaining quality on positives and ambient negative audio data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Topic Modeling · Software Engineering Research