TL;DR
This paper introduces UniKW-AT, a unified framework for keyword spotting and audio tagging that improves noise robustness and enables joint training, achieving high accuracy and robustness on multiple datasets.
Contribution
It presents the first unified approach for KWS and AT, extending the pipeline with additional labels and demonstrating improved noise robustness and performance.
Findings
Achieves 97.53% accuracy on GSCV1 dataset.
Shows significant noise robustness gains in real-world KWS.
Merges KWS and AT without performance loss.
Abstract
Within the audio research community and the industry, keyword spotting (KWS) and audio tagging (AT) are seen as two distinct tasks and research fields. However, from a technical point of view, both of these tasks are identical: they predict a label (keyword in KWS, sound event in AT) for some fixed-sized input audio segment. This work proposes UniKW-AT: An initial approach for jointly training both KWS and AT. UniKW-AT enhances the noise-robustness for KWS, while also being able to predict specific sound events and enabling conditional wake-ups on sound events. Our approach extends the AT pipeline with additional labels describing the presence of a keyword. Experiments are conducted on the Google Speech Commands V1 (GSCV1) and the balanced Audioset (AS) datasets. The proposed MobileNetV2 model achieves an accuracy of 97.53% on the GSCV1 dataset and an mAP of 33.4 on the AS evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
