Evaluating the Representation of Vowels in Wav2Vec Feature Extractor: A Layer-Wise Analysis Using MFCCs

Domenico De Cristofaro; Vincenzo Norman Vitale; Alessandro Vietti

arXiv:2508.17914·cs.CL·August 26, 2025

Evaluating the Representation of Vowels in Wav2Vec Feature Extractor: A Layer-Wise Analysis Using MFCCs

Domenico De Cristofaro, Vincenzo Norman Vitale, Alessandro Vietti

PDF

TL;DR

This paper investigates how well Wav2Vec's CNN layers capture vowel information by comparing their features with traditional MFCCs and formants, using SVM classifiers on the TIMIT corpus.

Contribution

It provides a layer-wise analysis of Wav2Vec's CNN features for vowel representation, highlighting their phonetic encoding capabilities compared to traditional features.

Findings

01

CNN features contain significant vowel information.

02

MFCCs with formants outperform CNN features in classification.

03

Layer-wise differences reveal how phonetic information is encoded.

Abstract

Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.