Lightweight End-to-End Speech Recognition from Raw Audio Data Using   Sinc-Convolutions

Ludwig K\"urzinger; Nicolas Lindae; Palle Klewitz; Gerhard Rigoll

arXiv:2010.07597·eess.AS·October 19, 2020

Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions

Ludwig K\"urzinger, Nicolas Lindae, Palle Klewitz, Gerhard Rigoll

PDF

TL;DR

This paper introduces Lightweight Sinc-Convolutions, a learnable feature extraction method integrated into end-to-end speech recognition models, achieving high accuracy with significantly reduced model size.

Contribution

It proposes a novel low-parameter Sinc-Convolution based feature extractor for end-to-end ASR, improving efficiency and accuracy over traditional handcrafted features.

Findings

01

Achieved 10.7% WER on TEDlium v2, outperforming log-mel filterbank features.

02

Model size is only 21% of the comparable architecture with traditional features.

03

Smooth convergence behavior enhanced by SpecAugment in time domain.

Abstract

Many end-to-end Automatic Speech Recognition (ASR) systems still rely on pre-processed frequency-domain features that are handcrafted to emulate the human hearing. Our work is motivated by recent advances in integrated learnable feature extraction. For this, we propose Lightweight Sinc-Convolutions (LSC) that integrate Sinc-convolutions with depthwise convolutions as a low-parameter machine-learnable feature extraction for end-to-end ASR systems. We integrated LSC into the hybrid CTC/attention architecture for evaluation. The resulting end-to-end model shows smooth convergence behaviour that is further improved by applying SpecAugment in time-domain. We also discuss filter-level improvements, such as using log-compression as activation function. Our model achieves a word error rate of 10.7% on the TEDlium v2 test dataset, surpassing the corresponding architecture with log-mel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.