UniPET-SPK: A Unified Framework for Parameter-Efficient Tuning of Pre-trained Speech Models for Robust Speaker Verification
Mufan Sang, John H. L. Hansen

TL;DR
This paper introduces UniPET-SPK, a unified parameter-efficient tuning framework for large pre-trained speech models, significantly reducing training costs while improving speaker verification performance across multiple datasets.
Contribution
The study proposes a novel unified framework combining adapter and prompt tuning with a dynamic gating mechanism for speech models, enabling efficient adaptation with minimal parameter updates.
Findings
Outperforms fine-tuning and other PET methods on multiple datasets
Achieves superior speaker verification accuracy with only 5.4% of parameters updated
Demonstrates robustness across diverse speech datasets
Abstract
With excellent generalization ability, SSL speech models have shown impressive performance on various downstream tasks in the pre-training and fine-tuning paradigm. However, as the size of pre-trained models grows, fine-tuning becomes practically unfeasible due to expanding computation and storage requirements and the risk of overfitting. This study explores parameter-efficient tuning (PET) methods for adapting large-scale pre-trained SSL speech models to speaker verification task. Correspondingly, we propose three PET methods: (i)an adapter-tuning method, (ii)a prompt-tuning method, and (iii)a unified framework that effectively incorporates adapter-tuning and prompt-tuning with a dynamically learnable gating mechanism. First, we propose the Inner+Inter Adapter framework, which inserts two types of adapters into pre-trained models, allowing for adaptation of latent features within the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
