OSSEM: one-shot speaker adaptive speech enhancement using meta learning
Cheng Yu, Szu-Wei Fu, Tsun-An Hsieh, Yu Tsao, Mirco Ravanelli

TL;DR
OSSEM introduces a meta-learning-based one-shot speaker-adaptive speech enhancement system that quickly adapts to individual speakers using minimal data, achieving real-time performance and competitive results.
Contribution
The paper presents a novel meta-learning approach for speaker adaptation in speech enhancement, enabling effective one-shot adaptation with a causal, real-time system.
Findings
Effective speaker adaptation with only one utterance.
Competitive performance against state-of-the-art causal SE systems.
Real-time, causal speech enhancement achieved.
Abstract
Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified transformer SE network and a speaker-specific masking (SSM) network. In practice, the SSM network takes an enrolled speaker embedding extracted using ECAPA-TDNN to adjust the input noisy feature through masking. To evaluate OSSEM, we designed a modified Voice Bank-DEMAND dataset, in which one utterance from the testing set was used for model adaptation, and the remaining utterances were used for testing the performance. Moreover, we set restrictions allowing the enhancement process to be conducted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders
