Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification

Li Zhang; Ning Jiang; Qing Wang; Yue Li; Quan Lu; and Lei Xie

arXiv:2407.10048·cs.SD·July 16, 2024

Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification

Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, and Lei Xie

PDF

Open Access

TL;DR

Whisper-SV enhances speaker verification in low-data scenarios by adapting the Whisper speech model with a novel layer selection and multi-layer aggregation framework, significantly improving accuracy over existing methods.

Contribution

This work introduces Whisper-SV, a lightweight adaptation framework with layer selection and multi-layer aggregation modules for effective low-data-resource speaker verification.

Findings

01

Achieves state-of-the-art results on VoxCeleb1, FFSVC, and IMSV datasets.

02

Significantly reduces EER and minDCF in low-data scenarios.

03

Demonstrates the effectiveness of multi-layer feature aggregation for speaker verification.

Abstract

Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis