UniPET-SPK: A Unified Framework for Parameter-Efficient Tuning of   Pre-trained Speech Models for Robust Speaker Verification

Mufan Sang; John H. L. Hansen

arXiv:2501.16542·eess.AS·January 29, 2025

UniPET-SPK: A Unified Framework for Parameter-Efficient Tuning of Pre-trained Speech Models for Robust Speaker Verification

Mufan Sang, John H. L. Hansen

PDF

Open Access

TL;DR

This paper introduces UniPET-SPK, a unified parameter-efficient tuning framework for large pre-trained speech models, significantly reducing training costs while improving speaker verification performance across multiple datasets.

Contribution

The study proposes a novel unified framework combining adapter and prompt tuning with a dynamic gating mechanism for speech models, enabling efficient adaptation with minimal parameter updates.

Findings

01

Outperforms fine-tuning and other PET methods on multiple datasets

02

Achieves superior speaker verification accuracy with only 5.4% of parameters updated

03

Demonstrates robustness across diverse speech datasets

Abstract

With excellent generalization ability, SSL speech models have shown impressive performance on various downstream tasks in the pre-training and fine-tuning paradigm. However, as the size of pre-trained models grows, fine-tuning becomes practically unfeasible due to expanding computation and storage requirements and the risk of overfitting. This study explores parameter-efficient tuning (PET) methods for adapting large-scale pre-trained SSL speech models to speaker verification task. Correspondingly, we propose three PET methods: (i)an adapter-tuning method, (ii)a prompt-tuning method, and (iii)a unified framework that effectively incorporates adapter-tuning and prompt-tuning with a dynamically learnable gating mechanism. First, we propose the Inner+Inter Adapter framework, which inserts two types of adapters into pre-trained models, allowing for adaptation of latent features within the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer