Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement   Network (E3Net) and Knowledge Distillation

Manthan Thakker; Sefik Emre Eskimez; Takuya Yoshioka; Huaming Wang

arXiv:2204.00771·eess.AS·April 5, 2022

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

Manthan Thakker, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang

PDF

Open Access

TL;DR

This paper introduces E3Net, a fast end-to-end speech enhancement model, and employs knowledge distillation and multi-task learning to create smaller, faster models that maintain high speech and transcription quality.

Contribution

The paper presents a novel end-to-end architecture (E3Net) for real-time personalized speech enhancement and demonstrates effective knowledge distillation techniques to produce compact, efficient models.

Findings

01

E3Net is 3 times faster than baseline models.

02

Knowledge distillation produces models 2-4 times faster with comparable quality.

03

Combining KD and MTL improves ASR and TSOS metrics without quality loss.

Abstract

This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an end-to-end enhancement (E3Net) model architecture, which is $3 \times$ faster than a baseline STFT-based model. Besides, we use KD techniques to develop compressed student models without significantly degrading quality. In addition, we investigate using noisy data without reference clean signals for training the student models, where we combine KD with multi-task learning (MTL) using automatic speech recognition (ASR) loss. Our results show that E3Net provides better speech and transcription quality with a lower target speaker over-suppression (TSOS) rate than the baseline model. Furthermore, we show that the KD methods can yield student models that are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Indoor and Outdoor Localization Technologies

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation