Evaluating the Representation of Vowels in Wav2Vec Feature Extractor: A Layer-Wise Analysis Using MFCCs
Domenico De Cristofaro, Vincenzo Norman Vitale, Alessandro Vietti

TL;DR
This paper investigates how well Wav2Vec's CNN layers capture vowel information by comparing their features with traditional MFCCs and formants, using SVM classifiers on the TIMIT corpus.
Contribution
It provides a layer-wise analysis of Wav2Vec's CNN features for vowel representation, highlighting their phonetic encoding capabilities compared to traditional features.
Findings
CNN features contain significant vowel information.
MFCCs with formants outperform CNN features in classification.
Layer-wise differences reveal how phonetic information is encoded.
Abstract
Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
