Regularizing Learnable Feature Extraction for Automatic Speech Recognition
Peter Vieting, Maximilian Kannen, Benedikt Hilmes, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper explores regularization techniques for learnable feature extractors in speech recognition, demonstrating that proper regularization significantly improves their performance and narrows the gap with traditional methods.
Contribution
It introduces audio perturbation and STFT-domain masking as effective regularization methods for learnable front-ends in ASR, addressing overfitting issues.
Findings
Regularization improves learnable feature extraction performance.
STFT-domain masking outperforms standard SpecAugment.
Performance gap with traditional features is effectively closed.
Abstract
Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
