Understanding Audio Features via Trainable Basis Functions

Kwan Yee Heung; Kin Wai Cheuk; Dorien Herremans

arXiv:2204.11437·cs.SD·April 26, 2022

Understanding Audio Features via Trainable Basis Functions

Kwan Yee Heung, Kin Wai Cheuk, Dorien Herremans

PDF

Open Access 1 Repo

TL;DR

This paper investigates making spectrogram basis functions trainable to enhance audio feature representation, demonstrating significant improvements in keyword spotting and speech recognition accuracy, especially in models with limited complexity.

Contribution

It introduces trainable basis functions for spectrograms, allowing models to adapt features for specific tasks, which improves performance in KWS and ASR.

Findings

01

Boosted KWS accuracy by 14.2 percentage points

02

Lowered Phone Error Rate by 9.5 percentage points

03

Provided insights into important frequency bins for tasks

Abstract

In this paper we explore the possibility of maximizing the information represented in spectrograms by making the spectrogram basis functions trainable. We experiment with two different tasks, namely keyword spotting (KWS) and automatic speech recognition (ASR). For most neural network models, the architecture and hyperparameters are typically fine-tuned and optimized in experiments. Input features, however, are often treated as fixed. In the case of audio, signals can be mainly expressed in two main ways: raw waveforms (time-domain) or spectrograms (time-frequency-domain). In addition, different spectrogram types are often used and tailored to fit different applications. In our experiments, we allow for this tailoring directly as part of the network. Our experimental results show that using trainable basis functions can boost the accuracy of Keyword Spotting (KWS) by 14.2 percentage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

heungky/trainable-stft-mel
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing