SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech

Lu Gan; Xi Li

arXiv:2511.07821·cs.SD·November 25, 2025

SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech

Lu Gan, Xi Li

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper presents SynTTS-Commands, a multilingual synthetic speech dataset generated via TTS for training on-device keyword spotting systems, achieving high accuracy and addressing data scarcity in low-power voice interfaces.

Contribution

Introduces a scalable, synthetic multilingual dataset for on-device KWS, validated by high recognition accuracy, reducing reliance on costly human recordings.

Findings

01

Achieved up to 99.5% accuracy on English commands

02

Achieved up to 98% accuracy on Chinese commands

03

Validated synthetic data as effective for training KWS models

Abstract

The development of high-performance, on-device keyword spotting (KWS) systems for ultra-low-power hardware is critically constrained by the scarcity of specialized, multi-command training datasets. Traditional data collection through human recording is costly, slow, and lacks scalability. This paper introduces SYNTTS-COMMANDS, a novel, multilingual voice command dataset entirely generated using state-of-the-art Text-to-Speech (TTS) synthesis. By leveraging the CosyVoice 2 model and speaker embeddings from public corpora, we created a scalable collection of English and Chinese commands. Extensive benchmarking across a range of efficient acoustic models demonstrates that our synthetic dataset enables exceptional accuracy, achieving up to 99.5\% on English and 98\% on Chinese command recognition. These results robustly validate that synthetic speech can effectively replace human-recorded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
lugan/SynTTS-Commands-Media-Benchmarks
model· 104 dl· ♡ 1
104 dl♡ 1

Datasets

lugan/Syntts-Commands-Media-Dataset
dataset· 34 dl
34 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques