Comparative Analysis of the wav2vec 2.0 Feature Extractor
Peter Vieting, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper compares wav2vec 2.0's convolutional feature extractor with traditional methods in speech recognition, showing both are competitive and highlighting the importance of bandpass filters in learned features.
Contribution
It provides an extensive analysis of wav2vec 2.0's feature extractor, comparing it to traditional methods and examining the learned filters' significance in ASR performance.
Findings
Both neural FEs are competitive with traditional FEs on LibriSpeech.
The most important information is captured by a set of bandpass filters.
Analysis of learned filters reveals key features for ASR.
Abstract
Automatic speech recognition (ASR) systems typically use handcrafted feature extraction pipelines. To avoid their inherent information loss and to achieve more consistent modeling from speech to transcribed text, neural raw waveform feature extractors (FEs) are an appealing approach. Also the wav2vec 2.0 model, which has recently gained large popularity, uses a convolutional FE which operates directly on the speech waveform. However, it is not yet studied extensively in the literature. In this work, we study its capability to replace the standard feature extraction methods in a connectionist temporal classification (CTC) ASR model and compare it to an alternative neural FE. We show that both are competitive with traditional FEs on the LibriSpeech benchmark and analyze the effect of the individual components. Furthermore, we analyze the learned filters and show that the most important…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
