Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks

Linus Stuhlmann; Michael Alexander Saxer

arXiv:2509.00230·cs.SD·September 30, 2025

Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks

Linus Stuhlmann, Michael Alexander Saxer

PDF

Open Access

TL;DR

This paper compares Wav2Vec 2.0, XLS-R, and Whisper speech models, analyzing their layer-wise features and optimal configurations for speaker identification, revealing how each model captures speaker-specific information at different depths.

Contribution

It provides a detailed layer-wise analysis of three speech models for speaker identification and identifies optimal transformer layer configurations for each.

Findings

01

Wav2Vec 2.0 and XLS-R capture speaker features in early layers.

02

Fine-tuning enhances model stability and performance.

03

Whisper performs better in deeper layers.

Abstract

This study evaluates the performance of three advanced speech encoder models, Wav2Vec 2.0, XLS-R, and Whisper, in speaker identification tasks. By fine-tuning these models and analyzing their layer-wise representations using SVCCA, k-means clustering, and t-SNE visualizations, we found that Wav2Vec 2.0 and XLS-R capture speaker-specific features effectively in their early layers, with fine-tuning improving stability and performance. Whisper showed better performance in deeper layers. Additionally, we determined the optimal number of transformer layers for each model when fine-tuned for speaker identification tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders