Efficient Personalized Speech Enhancement through Self-Supervised Learning
Aswin Sivaraman, Minje Kim

TL;DR
This paper introduces self-supervised learning techniques for personalized speech enhancement that effectively adapt to individual speakers using minimal or no clean target data, reducing data needs and model size.
Contribution
It proposes novel self-supervised methods enabling zero- and few-shot personalization of speech enhancement models without requiring clean target speech.
Findings
Self-supervised models achieve effective zero-shot personalization.
Models require less clean data and fewer parameters.
Enhanced data efficiency and model compression demonstrated.
Abstract
This work presents self-supervised learning methods for developing monaural speaker-specific (i.e., personalized) speech enhancement models. While generalist models must broadly address many speakers, specialist models can adapt their enhancement function towards a particular speaker's voice, expecting to solve a narrower problem. Hence, specialists are capable of achieving more optimal performance in addition to reducing computational complexity. However, naive personalization methods can require clean speech from the target user, which is inconvenient to acquire, e.g., due to subpar recording conditions. To this end, we pose personalization as either a zero-shot task, in which no additional clean speech of the target speaker is used for training, or a few-shot learning task, in which the goal is to minimize the duration of the clean speech used for transfer learning. With this paper,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
