Understanding Audio Features via Trainable Basis Functions
Kwan Yee Heung, Kin Wai Cheuk, Dorien Herremans

TL;DR
This paper investigates making spectrogram basis functions trainable to enhance audio feature representation, demonstrating significant improvements in keyword spotting and speech recognition accuracy, especially in models with limited complexity.
Contribution
It introduces trainable basis functions for spectrograms, allowing models to adapt features for specific tasks, which improves performance in KWS and ASR.
Findings
Boosted KWS accuracy by 14.2 percentage points
Lowered Phone Error Rate by 9.5 percentage points
Provided insights into important frequency bins for tasks
Abstract
In this paper we explore the possibility of maximizing the information represented in spectrograms by making the spectrogram basis functions trainable. We experiment with two different tasks, namely keyword spotting (KWS) and automatic speech recognition (ASR). For most neural network models, the architecture and hyperparameters are typically fine-tuned and optimized in experiments. Input features, however, are often treated as fixed. In the case of audio, signals can be mainly expressed in two main ways: raw waveforms (time-domain) or spectrograms (time-frequency-domain). In addition, different spectrogram types are often used and tailored to fit different applications. In our experiments, we allow for this tailoring directly as part of the network. Our experimental results show that using trainable basis functions can boost the accuracy of Keyword Spotting (KWS) by 14.2 percentage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
