Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization

Jiangyu Han; Federico Landini; Johan Rohdin; Anna Silnova; Mireia Diez; Jan Cernocky; Lukas Burget

arXiv:2505.24111·eess.AS·June 2, 2025·Interspeech

Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization

Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, Jan Cernocky, Lukas Burget

PDF

Open Access 5 Models

TL;DR

This paper presents a method to compress self-supervised speech models for speaker diarization by fine-tuning before structured pruning, achieving significant size reduction and speedup without performance loss.

Contribution

It introduces a novel approach of fine-tuning SSL models prior to structured pruning, enabling high compression ratios with maintained accuracy.

Findings

01

Up to 80% parameter reduction without performance loss.

02

Inference speed increased by 4.0x and 2.6x on GPU.

03

Effective on multiple datasets including AMI, AISHELL-4, and AliMeeting.

Abstract

Self-supervised learning (SSL) models like WavLM can be effectively utilized when building speaker diarization systems but are often large and slow, limiting their use in resource constrained scenarios. Previous studies have explored compression techniques, but usually for the price of degraded performance at high pruning ratios. In this work, we propose to compress SSL models through structured pruning by introducing knowledge distillation. Different from the existing works, we emphasize the importance of fine-tuning SSL models before pruning. Experiments on far-field single-channel AMI, AISHELL-4, and AliMeeting datasets show that our method can remove redundant parameters of WavLM Base+ and WavLM Large by up to 80% without any performance degradation. After pruning, the inference speeds on a single GPU for the Base+ and Large models are 4.0 and 2.6 times faster, respectively. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsPruning