TL;DR
This paper demonstrates that combining triplet loss-based embeddings with a kNN classifier significantly improves speech keyword spotting accuracy, achieving state-of-the-art results on multiple datasets.
Contribution
It introduces a novel phonetic similarity triplet mining method and shows that this combination outperforms traditional classification techniques in speech recognition tasks.
Findings
26% to 38% improvement in classification accuracy
Achieved 98.55% accuracy on Google Speech Commands V1
Achieved 97.0% accuracy on Google Speech Commands V2 35-class
Abstract
In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most no-tably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also improve the current best published SOTA for Google Speech Commands dataset V1 10+2 -class classification by about 34%, achieving 98.55% accuracy, V2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
