Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification
Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, and Lei Xie

TL;DR
Whisper-SV enhances speaker verification in low-data scenarios by adapting the Whisper speech model with a novel layer selection and multi-layer aggregation framework, significantly improving accuracy over existing methods.
Contribution
This work introduces Whisper-SV, a lightweight adaptation framework with layer selection and multi-layer aggregation modules for effective low-data-resource speaker verification.
Findings
Achieves state-of-the-art results on VoxCeleb1, FFSVC, and IMSV datasets.
Significantly reduces EER and minDCF in low-data scenarios.
Demonstrates the effectiveness of multi-layer feature aggregation for speaker verification.
Abstract
Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
