Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Hyun Jin Park; Dhruuv Agarwal; Neng Chen; Rentao Sun; Kurt Partridge; Justin Chen; Harry Zhang; Pai Zhu; Jacob Bartel; Kyle Kastner; Gary Wang; Andrew Rosenberg; Quan Wang

arXiv:2408.10463·cs.SD·February 6, 2026

Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Hyun Jin Park, Dhruuv Agarwal, Neng Chen, Rentao Sun, Kurt Partridge, Justin Chen, Harry Zhang, Pai Zhu, Jacob Bartel, Kyle Kastner, Gary Wang, Andrew Rosenberg, Quan Wang

PDF

Open Access

TL;DR

This paper introduces an adversarial training approach for keyword spotting models to prevent overfitting to TTS artifacts, significantly improving real speech accuracy and robustness even with limited real data.

Contribution

It proposes a novel adversarial training method to reduce TTS artifact overfitting in KWS models, enhancing real speech detection performance.

Findings

01

Up to 12% accuracy improvement on real speech with adversarial loss.

02

Adversarial training improves accuracy by up to 8% even without positive real examples.

03

Method reduces overfitting to TTS artifacts, increasing robustness.

Abstract

The keyword spotting (KWS) problem requires large amounts of real speech training data to achieve high accuracy across diverse populations. Utilizing large amounts of text-to-speech (TTS) synthesized data can reduce the cost and time associated with KWS development. However, TTS data may contain artifacts not present in real speech, which the KWS model can exploit (overfit), leading to degraded accuracy on real speech. To address this issue, we propose applying an adversarial training method to prevent the KWS model from learning TTS-specific features when trained on large amounts of TTS data. Experimental results demonstrate that KWS model accuracy on real speech data can be improved by up to 12% when adversarial loss is used in addition to the original KWS loss. Surprisingly, we also observed that the adversarial setup improves accuracy by up to 8%, even when trained solely on TTS and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Spam and Phishing Detection · Handwritten Text Recognition Techniques