MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio   Events

Xiaoyu Yang; Qiujia Li; Chao Zhang; Phil Woodland

arXiv:2409.17010·eess.AS·February 21, 2025

MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events

Xiaoyu Yang, Qiujia Li, Chao Zhang, Phil Woodland

PDF

Open Access

TL;DR

This paper introduces MT2KD, a two-stage multi-task learning framework that creates a versatile encoder capable of handling speech recognition, audio tagging, and speaker verification with high accuracy and efficiency.

Contribution

The paper presents a novel two-stage training approach combining multi-teacher knowledge distillation and supervised fine-tuning to develop a general-purpose speech and audio encoder.

Findings

01

Significantly outperforms baseline multi-task models.

02

Achieves competitive results on ASR, AT, and SV tasks.

03

Uses only 66M parameters for a versatile encoder.

Abstract

With the advances in deep learning, the performance of end-to-end (E2E) single-task models for speech and audio processing has been constantly improving. However, it is still challenging to build a general-purpose model with high performance on multiple tasks, since different speech and audio processing tasks usually require different training data, input features, or model architectures to achieve optimal performance. In this work, MT2KD, a novel two-stage multi-task learning framework is proposed to build a general-purpose speech and audio encoder that jointly performs three fundamental tasks: automatic speech recognition (ASR), audio tagging (AT) and speaker verification (SV). In the first stage, multi-teacher knowledge distillation (KD) is applied to align the feature spaces of three single-task high-performance teacher encoders into a single student encoder using the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsKnowledge Distillation · ALIGN